Practical Solr: TF-IDF

How do we score documents? Well the simplest way would be to count how many times a word (term) occurs in a document. The high the value, the more relevant the document. This is called the Term Frequency or TF. However certain words are very common. Take the phrase:

Neanderthal man

Man is the one of the most frequently occurring words in the English language. If we ranked just on term frequency documents featuring only the word “man” would dominate the second term “Neanderthal”. To identify important terms it is useful to calculate frequently they occur in a document set, this is referred to as the Document Frequency or DF. If we divide TF by DF we get a measure of how important a term is as a keyword for that topic. This is referred to as TF-IDF or TF multiplied by 1/DF.

Taking our example we have 5 documents. Man occurs in 4 of them, Neanderthal in 2:

Doc	Term		Score
Doc	Neanderthal	Man	Score
1	TF 4, Score 4/2=2	TF 4, Score 4/4=1	3
2	TF 0, Score 0	TF 5, Score 5/4=1.25	1
3	TF 2, Score 2/2=1	TF 2, Score 2/4=0.5	1.5
4	TF 0, Score 0	TF 0, Score 0	0
5	TF 0, Score 0	TF 2, Score 2/4=0.75	0.5

We can see that the common term Man gets has its score reduced to a certain extent. So (in this simple example) the documents containing both terms: 1 and 3 score higher. Without including the DF document 2 would rank higher than document 3.

Note from Solr 6 a different algorithm is used (Best Match 25 or BM25) but this is still based on TF-IDF.

Practical Solr

Tuesday, January 3, 2017

TF-IDF

How do we score documents? Well the simplest way would be to count how many times a word (term) occurs in a document. The high the value, the more relevant the document. This is called the Term Frequency or TF. However certain words are very common. Take the phrase:

No comments:

Post a Comment

Blog Archive