How do we score documents? Well the simplest way would be to count how many times a word (term) occurs in a document. The high the value, the more relevant the document. This is called the Term Frequency or TF. However certain words are very common. Take the phrase:
Neanderthal man
Man is the one of the most frequently occurring words in the English language. If we ranked just on term frequency documents featuring only the word “man” would dominate the second term “Neanderthal”. To identify important terms it is useful to calculate frequently they occur in a document set, this is referred to as the Document Frequency or DF. If we divide TF by DF we get a measure of how important a term is as a keyword for that topic. This is referred to as TF-IDF or TF multiplied by 1/DF.
Taking our example we have 5 documents. Man occurs in 4 of them, Neanderthal in 2:
Doc
|
Term
|
Score
| |
Neanderthal
|
Man
| ||
1
|
TF 4, Score 4/2=2
|
TF 4, Score 4/4=1
|
3
|
2
|
TF 0, Score 0
|
TF 5, Score 5/4=1.25
|
1
|
3
|
TF 2, Score 2/2=1
|
TF 2, Score 2/4=0.5
|
1.5
|
4
|
TF 0, Score 0
|
TF 0, Score 0
|
0
|
5
|
TF 0, Score 0
|
TF 2, Score 2/4=0.75
|
0.5
|
We can see that the common term Man gets has its score reduced to a certain extent. So (in this simple example) the documents containing both terms: 1 and 3 score higher. Without including the DF document 2 would rank higher than document 3.
Note from Solr 6 a different algorithm is used (Best Match 25 or BM25) but this is still based on TF-IDF.
No comments:
Post a Comment