Practical Solr: January 2017

Document age scoring takes the freshness of a document into account when ranking results. A very general rule is that the newer the document, the more relevance it has. Consider a News site, recent news stories, say from the last day or so, are of much more interest for generic headline search. Obviously archives have benefit for advanced searchers doing research.

The example below considers a set of documents relating to company reports. Search is on the company Name.

No timestamp boost

These results have no timestamp boosting applied. In the case of identical results (Universal Tool Inc) the document id is used as a tiebreaker to rank documents. Lower document ids rank higher. This means that older documents are shown first.

Name	Score	Timestamp
UNIVERSAL FISHER LLC	2.3220387	2015-12-02T02:18:24Z
UNIVERSAL TOOL INC	2.3220387	2016-11-05T05:06:16Z
UNIVERSAL TOOL INC	2.3220387	2016-11-05T05:06:22Z
UNIVERSAL TOOL INC.	2.3220387	2016-12-01T09:17:47Z
UNIVERSAL TOOL INC.	2.3220387	2016-11-30T02:24:15Z
UNIVERSAL MICRO BUSINESS SOLUTIONS	2.0317838	2016-11-22T02:13:13Z
UNIVERSAL WEATHER AND DATA SYSTEMS	1.741529	2016-11-22T01:54:06Z
UNIVERSAL WEATHER AND DATA SYSTEMS	1.741529	2016-11-28T22:31:49Z
CNC UNIVERSAL	1.6880591	2016-10-06T17:24:22Z
CNC UNIVERSAL	1.6880591	2016-11-11T17:09:15Z

Timestamp boost with exponential decay

The recip function is frequently used when considering the freshness of documents. This can be tested in the Solr admin console using the “bf” field with the edismax query parser. Newer documents should score higher than older documents.

bf=recip(ms(NOW/HOUR,timestamp),3.16e-11,0.08,0.05)

This is sometimes called a “fuzzy ordering”. It is just one of a number of factors taken into account for document ranking as can be seen the results below.

Here are the complete results. For matching results ordering is by date but a new. The result Universal E-Business Solutions scores lower despite being a new document because the field length is longer. The document CNC Universal has also made its way into the top ten.

Name	score	timestamp
UNIVERSAL TOOL INC	2.3235977	2016-12-01T09:17:47Z
UNIVERSAL TOOL INC	2.3235607	2016-11-30T02:24:15Z
UNIVERSAL TOOL INC	2.3230824	2016-11-05T05:06:16Z
UNIVERSAL TOOL INC	2.3230824	2016-11-05T05:06:22Z
UNIVERSAL FISHER LLC	2.3222296	2015-12-02T02:18:24Z
UNIVERSAL MICRO BUSINESS SOLUTIONS	2.0331118	2016-11-22T02:13:13Z
UNIVERSAL WEATHER AND DATA SYSTEMS	1.7430217	2016-11-28T22:31:49Z
UNIVERSAL WEATHER AND DATA SYSTEMS	1.7428579	2016-11-22T01:54:06Z
CNC UNIVERSAL	1.6896107	2016-12-01T01:04:48Z
CNC Universal	1.6895752	2016-11-29T18:52:13Z

For reference the function :

recip(x, m, a, b) implements f(x) = a/(xm+b)

Where:

x : the document age in ms, defined as ms(NOW,<datefield>). In our example we round to the nearest hour

m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies using the inverse : 3.16e-11 (1/3.16e10 rounded). For a News site a document that is 7 days old could be considered stale.

a and b are constants (defined arbitrarily).

xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).

xm ≈ 0 when the document is new, resulting in a value close to a/b.

Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.

With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.

How to make date boosting stronger ?

Increase m : choose a lower reference_time for example 6 months, that gives m = 6.33e-11. Compared to a 1 year reference, the multiplier decreases 2x faster as the document age increases.

Decreasing a and b expands the response curve of the function. This can be very agressive. See this graph:

ref: http://www.slideshare.net/lucenerevolution/potter-timothy-boosting-documents-in-solr

bf scoring

“bf” is an additive boost. Itis independant of the global score (relevancy). That means that its weight on relevant results (high scores) is less than on poorer matches (low scores).

These is the scoring without timestamp boosting. The overall score is the product of the queryWeight and fieldWeights:

2.3220387 = score(doc=1704,freq=1.0), product of:

0.4467017 = queryWeight, product of:

15.0 = boost

10.396373 = idf(), sum of:

1.5321448 = idf(docFreq=24456, maxDocs=41640)

8.864228 = idf(docFreq=15, maxDocs=41640)

0.0028644716 = queryNorm

5.1981864 = fieldWeight in 1704, product of:

1.0 = tf(freq=1.0), with freq of:

1.0 = phraseFreq=1.0

10.396373 = idf(), sum of:

1.5321448 = idf(docFreq=24456, maxDocs=41640)

8.864228 = idf(docFreq=15, maxDocs=41640)

0.5 = fieldNorm(doc=1704)

This example uses timestamp boosts. Note the queryNorm value (in red) changes. This is an attempt by Solr to make the scores across the two different queries compatible (normalization) so they can be compared. It is not generally considered to be accurate. The overall score is: (0.44669986 * 5.1981864) + 0.0015685626

2.323597699 = score(doc=7834,freq=1.0), product of:

0.44669986 = queryWeight, product of:

15.0 = boost

10.396373 = idf(), sum of:

1.5321448 = idf(docFreq=24456, maxDocs=41640)

8.864228 = idf(docFreq=15, maxDocs=41640)

0.0028644598 = queryNorm

5.1981864 = fieldWeight in 7834, product of:

1.0 = tf(freq=1.0), with freq of:

1.0 = phraseFreq=1.0

10.396373 = idf(), sum of:

1.5321448 = idf(docFreq=24456, maxDocs=41640)

8.864228 = idf(docFreq=15, maxDocs=41640)

0.5 = fieldNorm(doc=7834)

0.0015685626 = product of:

0.54759455 = 0.08/(3.16E-11*float(ms(const(1483624800000),date(timestamp)=2016-12-01T09:17:47Z))+0.05)

1.0 = boost

0.0028644598 = queryNorm

How do we score documents? Well the simplest way would be to count how many times a word (term) occurs in a document. The high the value, the more relevant the document. This is called the Term Frequency or TF. However certain words are very common. Take the phrase:

Neanderthal man

Man is the one of the most frequently occurring words in the English language. If we ranked just on term frequency documents featuring only the word “man” would dominate the second term “Neanderthal”. To identify important terms it is useful to calculate frequently they occur in a document set, this is referred to as the Document Frequency or DF. If we divide TF by DF we get a measure of how important a term is as a keyword for that topic. This is referred to as TF-IDF or TF multiplied by 1/DF.

Taking our example we have 5 documents. Man occurs in 4 of them, Neanderthal in 2:

Doc	Term		Score
Doc	Neanderthal	Man	Score
1	TF 4, Score 4/2=2	TF 4, Score 4/4=1	3
2	TF 0, Score 0	TF 5, Score 5/4=1.25	1
3	TF 2, Score 2/2=1	TF 2, Score 2/4=0.5	1.5
4	TF 0, Score 0	TF 0, Score 0	0
5	TF 0, Score 0	TF 2, Score 2/4=0.75	0.5

We can see that the common term Man gets has its score reduced to a certain extent. So (in this simple example) the documents containing both terms: 1 and 3 score higher. Without including the DF document 2 would rank higher than document 3.

Note from Solr 6 a different algorithm is used (Best Match 25 or BM25) but this is still based on TF-IDF.

Practical Solr

Thursday, January 5, 2017

Document Age Scoring

No timestamp boost

Timestamp boost with exponential decay

bf scoring

Tuesday, January 3, 2017

What is in a Lucene index?

Some takeouts from Adrien's Lecture

When to commit?

Segments

TF-IDF

How do we score documents? Well the simplest way would be to count how many times a word (term) occurs in a document. The high the value, the more relevant the document. This is called the Term Frequency or TF. However certain words are very common. Take the phrase:

Blog Archive