Tuesday, January 3, 2017

What is in a Lucene index?

Adrien Grand, Software Engineer, Elasticsearch explains the internals of Lucene Indexes. Good if you are working with Solr too. The slideshare notes are here: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal

Some takeouts from Adrien's Lecture

  • The point of indexing is making data fast to search.
  • Don’t make analysis chains too complex.
  • Duplicate fields as and when necessary.

When to commit?

When Lucene updates index it creates a new segment (saves updating term dictionary which is expensive). However this is an expensive operation so better to group commits to disk. Lucene term dictionary is filesystem cache friendly – it will never change during its lifetime. Also has benefits as is multi-thread safe, there is no need for locks.

In Solr, when you add a document, you can give a commit window. This will group all document commits within this window. Also remember filter cache gets invalidated on commit.

Segments

When searching need to concatenate results from all segements. Lucene merges segments periodically. Merging is not expensive. Deletion is done by setting bit to zero (in a bitset – like a filter query) - so no space is recovered. When merging, deleted documents are not merged. Old segments can then be deleted freeing up space.




No comments:

Post a Comment