Document age scoring takes the freshness of a document into account when ranking results. A very general rule is that the newer the document, the more relevance it has. Consider a News site, recent news stories, say from the last day or so, are of much more interest for generic headline search. Obviously archives have benefit for advanced searchers doing research.
Here are
the complete results. For matching results ordering is by date but a new. The result
Universal E-Business Solutions scores lower despite being a new document
because the field length is longer. The document CNC Universal has also made its way into the top ten.
The example below considers a set of documents relating to company reports. Search is on the company Name.
No timestamp boost
These
results have no timestamp boosting applied. In the case of identical results (Universal Tool Inc) the document id is used as a tiebreaker to rank documents. Lower document ids rank higher. This means that older documents are shown first.
Name
|
Score
|
Timestamp
|
UNIVERSAL FISHER LLC
|
2.3220387
|
2015-12-02T02:18:24Z
|
UNIVERSAL TOOL INC
|
2.3220387
|
2016-11-05T05:06:16Z
|
UNIVERSAL TOOL INC
|
2.3220387
|
2016-11-05T05:06:22Z
|
UNIVERSAL TOOL INC.
|
2.3220387
|
2016-12-01T09:17:47Z
|
UNIVERSAL TOOL INC.
|
2.3220387
|
2016-11-30T02:24:15Z
|
UNIVERSAL MICRO BUSINESS SOLUTIONS
|
2.0317838
|
2016-11-22T02:13:13Z
|
UNIVERSAL WEATHER AND DATA SYSTEMS
|
1.741529
|
2016-11-22T01:54:06Z
|
UNIVERSAL WEATHER AND DATA SYSTEMS
|
1.741529
|
2016-11-28T22:31:49Z
|
CNC UNIVERSAL
|
1.6880591
|
2016-10-06T17:24:22Z
|
CNC UNIVERSAL
|
1.6880591
|
2016-11-11T17:09:15Z
|
Timestamp boost with exponential decay
The recip function is frequently used when considering the freshness of documents. This can be tested in the Solr admin console using the “bf”
field with the edismax query parser. Newer documents should score higher than
older documents.
bf=recip(ms(NOW/HOUR,timestamp),3.16e-11,0.08,0.05)
This is sometimes called a “fuzzy ordering”. It is just one of a number of factors taken
into account for document ranking as can be seen the results below.
Name
|
score
|
timestamp
|
UNIVERSAL
TOOL INC
|
2.3235977
|
2016-12-01T09:17:47Z
|
UNIVERSAL TOOL INC
|
2.3235607
|
2016-11-30T02:24:15Z
|
UNIVERSAL TOOL INC
|
2.3230824
|
2016-11-05T05:06:16Z
|
UNIVERSAL TOOL INC
|
2.3230824
|
2016-11-05T05:06:22Z
|
UNIVERSAL
FISHER LLC
|
2.3222296
|
2015-12-02T02:18:24Z
|
UNIVERSAL MICRO BUSINESS SOLUTIONS
|
2.0331118
|
2016-11-22T02:13:13Z
|
UNIVERSAL
WEATHER AND DATA SYSTEMS
|
1.7430217
|
2016-11-28T22:31:49Z
|
UNIVERSAL
WEATHER AND DATA SYSTEMS
|
1.7428579
|
2016-11-22T01:54:06Z
|
CNC UNIVERSAL
|
1.6896107
|
2016-12-01T01:04:48Z
|
CNC Universal
|
1.6895752
|
2016-11-29T18:52:13Z
|
For reference the function :
recip(x,
m, a, b) implements f(x) = a/(xm+b)
Where:
x : the document age in ms, defined as
ms(NOW,<datefield>). In our example we round to the nearest hour
m : a constant that defines a time scale which
is used to apply boost. It should be relative to what you consider an old
document age (a reference_time) in milliseconds. For example, choosing a
reference_time of 1 year (3.16e10ms) implies using the inverse : 3.16e-11
(1/3.16e10 rounded). For a News site a document that is 7 days old could be considered stale.
a and b are constants (defined arbitrarily).
xm = 1 when
the document is 1 reference_time old (multiplier = a/(1+b)).
xm ≈ 0 when
the document is new, resulting in a value close to a/b.
Using the
same value for a and b ensures the multiplier doesn't exceed 1 with recent
documents.
With a = b
= 1, a 1 reference_time old document has a multiplier of about 1/2, a 2
reference_time old document has a multiplier of about 1/3, and so on.
How to make
date boosting stronger ?
Increase m
: choose a lower reference_time for example 6 months, that gives m = 6.33e-11. Compared to a 1 year reference,
the multiplier decreases 2x faster as the document age increases.
Decreasing
a and b expands the response curve of the function. This can be very agressive.
See this graph:
ref: http://www.slideshare.net/lucenerevolution/potter-timothy-boosting-documents-in-solr
bf scoring
“bf” is an additive boost. Itis independant of the
global score (relevancy). That means that its weight on relevant results (high
scores) is less than on poorer matches (low scores).
These is
the scoring without timestamp boosting. The overall score is the product of the
queryWeight and fieldWeights:
2.3220387 = score(doc=1704,freq=1.0),
product of:
0.4467017
= queryWeight, product of:
15.0 = boost
10.396373 = idf(), sum of:
1.5321448 = idf(docFreq=24456,
maxDocs=41640)
8.864228 = idf(docFreq=15,
maxDocs=41640)
0.0028644716 =
queryNorm
5.1981864
= fieldWeight in 1704, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
10.396373 = idf(), sum of:
1.5321448 = idf(docFreq=24456,
maxDocs=41640)
8.864228 = idf(docFreq=15,
maxDocs=41640)
0.5 = fieldNorm(doc=1704)
This example
uses timestamp boosts. Note the queryNorm value (in red) changes. This is an
attempt by Solr to make the scores across the two different queries compatible
(normalization) so they can be compared. It is not generally considered to be
accurate. The overall score is: (0.44669986 * 5.1981864) + 0.0015685626
2.323597699
= score(doc=7834,freq=1.0), product of:
0.44669986 = queryWeight, product of:
15.0 =
boost
10.396373 = idf(), sum of:
1.5321448 = idf(docFreq=24456,
maxDocs=41640)
8.864228 = idf(docFreq=15,
maxDocs=41640)
0.0028644598
= queryNorm
5.1981864
= fieldWeight in 7834, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
10.396373 = idf(), sum of:
1.5321448 = idf(docFreq=24456,
maxDocs=41640)
8.864228 = idf(docFreq=15,
maxDocs=41640)
0.5 = fieldNorm(doc=7834)
0.0015685626
= product of:
0.54759455 =
0.08/(3.16E-11*float(ms(const(1483624800000),date(timestamp)=2016-12-01T09:17:47Z))+0.05)
1.0 = boost
0.0028644598
= queryNorm