Thursday, January 5, 2017

Document Age Scoring

Document age scoring takes the freshness of a document into account when ranking results. A very general rule is that the newer the document, the more relevance it has. Consider a News site, recent news stories, say from the last day or so, are of much more interest for generic headline search. Obviously archives have benefit for advanced searchers doing research.

The example below considers a set of documents relating to company reports. Search is on the company Name.

 No timestamp boost

These results have no timestamp boosting applied. In the case of identical results (Universal Tool Inc) the document id is used as a tiebreaker to rank documents. Lower document ids rank higher. This means that older documents are shown first.

Name
Score
Timestamp
UNIVERSAL FISHER LLC
2.3220387
2015-12-02T02:18:24Z
UNIVERSAL TOOL INC
2.3220387
2016-11-05T05:06:16Z
UNIVERSAL TOOL INC
2.3220387
2016-11-05T05:06:22Z
UNIVERSAL TOOL INC.
2.3220387
2016-12-01T09:17:47Z
UNIVERSAL TOOL INC.
2.3220387
2016-11-30T02:24:15Z
UNIVERSAL MICRO BUSINESS SOLUTIONS
2.0317838
2016-11-22T02:13:13Z
UNIVERSAL WEATHER AND DATA SYSTEMS
1.741529
2016-11-22T01:54:06Z
UNIVERSAL WEATHER AND DATA SYSTEMS
1.741529
2016-11-28T22:31:49Z
CNC UNIVERSAL
1.6880591
2016-10-06T17:24:22Z
CNC UNIVERSAL
1.6880591
2016-11-11T17:09:15Z

Timestamp boost with exponential decay

The recip function is frequently used when considering the freshness of documents. This can be tested in the Solr admin console using the “bf” field with the edismax query parser. Newer documents should score higher than older documents.

bf=recip(ms(NOW/HOUR,timestamp),3.16e-11,0.08,0.05)

This is sometimes called a “fuzzy ordering”. It is just one of a number of factors taken into account for document ranking as can be seen the results below.






Here are the complete results. For matching results ordering is by date but a new. The result Universal E-Business Solutions scores lower despite being a new document because the field length is longer. The document CNC Universal has also made its way into the top ten.

Name
score
timestamp
UNIVERSAL TOOL INC
2.3235977
2016-12-01T09:17:47Z
UNIVERSAL TOOL INC
2.3235607
2016-11-30T02:24:15Z
UNIVERSAL TOOL INC
2.3230824
2016-11-05T05:06:16Z
UNIVERSAL TOOL INC
2.3230824
2016-11-05T05:06:22Z
UNIVERSAL FISHER LLC
2.3222296
2015-12-02T02:18:24Z
UNIVERSAL MICRO BUSINESS SOLUTIONS
2.0331118
2016-11-22T02:13:13Z
UNIVERSAL WEATHER AND DATA SYSTEMS
1.7430217
2016-11-28T22:31:49Z
UNIVERSAL WEATHER AND DATA SYSTEMS
1.7428579
2016-11-22T01:54:06Z
CNC UNIVERSAL
1.6896107
2016-12-01T01:04:48Z
CNC Universal
1.6895752
2016-11-29T18:52:13Z

For reference the function :

recip(x, m, a, b) implements f(x) = a/(xm+b)

Where:

x : the document age in ms, defined as ms(NOW,<datefield>). In our example we round to the nearest hour

m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies using the inverse : 3.16e-11 (1/3.16e10 rounded). For a News site a document that is 7 days old could be considered stale.

a and b are constants (defined arbitrarily).

xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).
xm ≈ 0 when the document is new, resulting in a value close to a/b.

Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.

With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.

How to make date boosting stronger ?

Increase m : choose a lower reference_time for example 6 months, that gives  m = 6.33e-11. Compared to a 1 year reference, the multiplier decreases 2x faster as the document age increases.

Decreasing a and b expands the response curve of the function. This can be very agressive. See this graph:


ref: http://www.slideshare.net/lucenerevolution/potter-timothy-boosting-documents-in-solr

bf scoring

“bf” is an additive boost. Itis independant of the global score (relevancy). That means that its weight on relevant results (high scores) is less than on poorer matches (low scores).

These is the scoring without timestamp boosting. The overall score is the product of the queryWeight and fieldWeights:

2.3220387 = score(doc=1704,freq=1.0), product of:
      0.4467017 = queryWeight, product of:
        15.0 = boost
        10.396373 = idf(), sum of:
          1.5321448 = idf(docFreq=24456, maxDocs=41640)
          8.864228 = idf(docFreq=15, maxDocs=41640)
        0.0028644716 = queryNorm
      5.1981864 = fieldWeight in 1704, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = phraseFreq=1.0
        10.396373 = idf(), sum of:
          1.5321448 = idf(docFreq=24456, maxDocs=41640)
          8.864228 = idf(docFreq=15, maxDocs=41640)
        0.5 = fieldNorm(doc=1704)
              
This example uses timestamp boosts. Note the queryNorm value (in red) changes. This is an attempt by Solr to make the scores across the two different queries compatible (normalization) so they can be compared. It is not generally considered to be accurate. The overall score is:  (0.44669986 * 5.1981864) + 0.0015685626 
                
2.323597699 = score(doc=7834,freq=1.0), product of:
        0.44669986 = queryWeight, product of:
          15.0 = boost
          10.396373 = idf(), sum of:
            1.5321448 = idf(docFreq=24456, maxDocs=41640)
            8.864228 = idf(docFreq=15, maxDocs=41640)
          0.0028644598 = queryNorm
        5.1981864 = fieldWeight in 7834, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = phraseFreq=1.0
          10.396373 = idf(), sum of:
            1.5321448 = idf(docFreq=24456, maxDocs=41640)
            8.864228 = idf(docFreq=15, maxDocs=41640)
          0.5 = fieldNorm(doc=7834)
  0.0015685626 = product of:
    0.54759455 = 0.08/(3.16E-11*float(ms(const(1483624800000),date(timestamp)=2016-12-01T09:17:47Z))+0.05)
    1.0 = boost
    0.0028644598 = queryNorm

No comments:

Post a Comment