All you need to know about Elasticsearch 5.0: Search
Elasticsearch 5.0 was released last week, as part of a wider release of the Elastic Stack which lines-up version numbers of all the stack products. Kibana, Logstash, Beats, Elasticsearch - are all version 5.0 now. This release is quite a large one, and includes thousands of change items. I personally find this release exciting.
It's quite easy to get lost in the details due to the sheer number of changes in this release. In this post I will summarize the items I see as important, with some of my own commentary and advice. Hopefully it will shed some light on where Elastic is standing and where they are headed.
This first post is focusing on search related topic . Future posts will focus on indices and cluster management, data ingestion capabilities, new debugging tools, ad-hoc batch processing, and more.
Full-text search
One fundamental feature of Elasticsearch is scoring - or results ranking by relevance. The part that handles it is a Lucene component called Similarity. ES 5.0 now makes Okapi BM25 the default similarity and that's quite an important change. The default has long been tf/idf, which is both simpler to understand but easier to be fooled by rogue results. BM25 is a probabalistic approach to ranking that almost always gives better results than the more vanilla tf/idf. I've been recommending customers to use BM25 over tf/idf for a long time now, and we also rely on it at Forter for doing quite a lot of interesting stuff. Overall, a good move by ES and I can finally archive a year's long advise. Britta Weber has a great talk on explaining the difference, and BM25 in particular, definitely a recommended watch.
Another good change is simplifying access to analyzed/not-analyzed fields. Often times you need to avoid tokenizing string fields because you want to be able to look for them as-is, or need to use them from aggregations or to sort by them - even if they include spaces or weird characters. Instead of calling both "string fields", they are now text (analyzed) and keyword (not-analyzed). This should improve readability of mappings and accessibility of that feature. The only remaining item in my opinion is the not-tokenized-but-lowercased case - it is common enough but will still require some rigorous configuration. It probably makes sense now to allow specifying "token-filters" to execute on "keyword" fields directly in that field's mapping; luckily there seems work on that is already underway.
While on this topic, one advice - if you need to lowercase keyword-type fields, you probably want to also asciifold them.
Better search due to low-level indexing tweaks
Historically, Elasticsearch is a text-search engine. When search for numeric values and ranges was added, it was still using string-matching based search by translating the numerics to something searchable also on ranges. Same goes for geo-spatial search - Elasticsearch (rather, the underlying Lucene engine) required a translation from whatever into a string to make it searchable.
Starting in ES 5.0 every index now also has a k-d tree, and that data-structure is where search is performed for all non-string fields instead of the string-based inverted index. This means numbers, geo-spatial points and shapes, and now even IPv6 (IPv4 was already supported before) are indexed natively and searches on them - including ranges - is multiple times faster than before.
You should be expecting to see more sophisticated geo-spatial queries, aggregations and other operations also thanks to Lucene's LatLonPoint which highly optimizes memory and disk footprints and search and indexing speeds. WKT support, searches on 2D shapes, 3D and even 4D+ shape search, adding dimensions from other sources (geo-spatial + some other metric collected from some datasource for example), interesting applications of nearest-neighbor searches, and more. The underlying libraries support many of them already and I've been hearing quite a lot of request for such capabilities. With this significant performance boost I reckon they will be finally exposed.
Lastly, since every value type which can be encoded as an ordered byte[] of fixed length can be searchable via k-d trees, we will probably start seeing some new types of data being indexed into Elasticsearch.
Read-your-write support
Anyone who ever wrote a CRUD-type application with eventually-consistent databases is familiar with the common gotcha of posting a form and then not seeing the new piece of data in the listing page, being confused for a moment and then refresh the page a second later and see it. This is annoying in back-end applications used internally, but can be terrible user experience if experienced by your end users.
Elasticsearch indexes are eventually-consistent. The search is officially defined as "near-real-time", or in other words - don't expect to immediately see the document you just added in search results. It can appear within one second (the index refresh rate), or a bit longer if you happen to query a replica.
Until now there wasn't a good way to know when to display the listing page after a successful form post. Adding a synthetic wait is just not deterministic enough and to be frank is quite a code smell, and forcing a refresh on write isn't recommended for many reasons.
ES5 adds the ability to wait for refresh on a query. If you specify ?refresh=wait_for on any index, update, or delete request, the request will block until a refresh has happened and the change is visible to search. If too many requests are queued up, it will force a refresh to clear out the queue. The refresh is awaited cluster wide - primaries and replicas.
Next up
I will be posting more posts about Elastic 5.0 focusing on more interesting capabilities, like new debugging enablers, batch processing support, index management improvements, data ingestion architectures and more. Stay tuned!
Furthermore, checkout my Elasticsearch courses - currently running in London and Israel via BigData Boutique, for developers and operations.