Adrien Grand, Software Engineer, Elasticsearch explains the internals of Lucene Indexes. Good if you are working with Solr too. The slideshare notes are here: http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal
Some takeouts from Adrien's Lecture
- The point of indexing is making data fast to search.
- Don’t make analysis chains too complex.
- Duplicate fields as and when necessary.
When to commit?
When Lucene
updates index it creates a new segment (saves updating term dictionary which is
expensive). However this is an expensive operation so better to group commits to disk. Lucene term dictionary is filesystem
cache friendly – it will never change during its lifetime. Also has benefits as is multi-thread safe, there is no need for locks.
In Solr, when you add a document, you can give a commit window. This will group all document commits within this window. Also remember filter cache gets invalidated on commit.
Segments
When
searching need to concatenate results from all segements. Lucene merges
segments periodically. Merging is not expensive. Deletion is done by setting
bit to zero (in a bitset – like a filter query) - so no space is recovered. When merging, deleted documents are not merged. Old segments can then be deleted freeing up space.
No comments:
Post a Comment