Hebrew search done right

Hebrew search, HebMorph, Lucene, ElasticSearch, IR


3 min read

It's been a few years since I started experimenting with Hebrew search, and during that time I was able to create HebMorph and see it get adopted in many places for various uses. I was involved in several projects - open-source and commercial - that needed Hebrew search capabilities, and learned a lot about what it means to perform Hebrew searches properly in various contexts.

Perhaps the most interesting usage of HebMorph is done by Buzzilla, the company I was recently employed by.

The queries in Buzzilla are written very carefully because to them every discussion counts. After issuing the query, statistical analysis is made on the number of discussions founds as well as some further analysis on a random set of discussions from the results set. Having irrelevant discussions in the results, or missing highly relevant discussions, can have severe effects. Therefore, queries are usually lengthy and need to be very precise. While stemming (rather, lemmatization) can and should be used, it needs to be applied selectively so it doesn't bloat search results with irrelevant discussions (high precision). Doing this is challenging when taking into account recall (the amount of relevant documents we missed).

This short guidance video (in Hebrew) demonstrates a very nice method in which we achieve exactly that. On one hand lemmatization is used and thus high recall is avoided, because we are able to overcome most of Hebrew's challenges when it comes to full-text search. On the other hand precision is kept high, because we can fine tune the query and use it like a laser-beam to find exactly the data that we are interested in, and only it.

With good UX and via user interaction we were able to come up with a solution to solve ambiguity problems as well as the Hebrew prefixes problem in search. This solution reflects very clearly what is being done by the search engine, and as a result the user can refine his query very easily. There are also custom dictionaries involved and some other various optimizations in place that aren't shown in this video.

The search engine showed in the video is using Elasticsearch and a custom Hebrew analyzer plugin that is based on HebMorph. It took us a while to get this right, but once we did response from our users were highly positive. This is really Hebrew search done right. The same techniques can be used for a 2-3 word Google-like searches by employing a bit different UX approach, I will have something to show this by soon.

I'll be blogging about the features shown above from the technical perspective in the near future, especially about selective-stemming and multi-lingual content.

Comments are now closed