Hebrew search with ElasticSearch using HebMorph
Hebrew search is not an easy task, and HebMorph is a project I started several years ago to address that problem. After a certain period of inactivity I'm back actively working on it. I'm also happy to say there are already several live systems using it to enable Hebrew searches in their applications.
This post is a short step-by-step guide on how to use HebMorph in an ElasticSearch installation. There are quite a few configuration options and things to consider when enabling Hebrew search, most are in the realm of performance vs relevance trade-offs, but I'll talk about those in a separate post.
0. What exactly is HebMorph
HebMorph is a project a bit wider than just providing a Hebrew search plugin for ElasticSearch, but for the purpose of this post let us treat it in that narrow aspect.
HebMorph has 3 main parts - the hspell dictionary files, the hebmorph-core package which is a wrapper around the dictionary files with important bits that allow for locating words even if they weren't written exactly as they appear in the dictionary, and the hebmorph-lucene package which contains various tools for processing streams of text into Lucene tokens - the searchable parts.
To enable Hebrew search from ElasticSearch we are going to need to use the Hebrew analyzer class HebMorph provides to analyze incoming Hebrew texts. That is done by providing ElasticSearch with the HebMorph packages and then telling it to use the Hebrew analyzer on text fields as needed.
1. Get HebMorph and hspell
At the moment you will have to compile HebMorph from sources yourself using Maven. In the future we might upload it to a centralized repository, but since we still actively working on a lot of stuff there it is still a bit too early for that.
Probably the easiest way to get HebMorph is to do git clone from the main repository. The repository is located at https://github.com/synhershko/HebMorph and includes the latest hspell files already under /hspell-data-files. If you are new to git GitHub offers great tutorials for getting started with it, and they also enable you to download the entire source tree as a zip or a tarball.
Once you have the sources, run mvn package or mvn install to create 2 jars - hebmorph-core and hebmorph-lucene. Those 2 packages are required before moving on to the next step.
2. Create an ElasticSearch plugin
In this step we will create a new plugin which we will use in the next step to create the Hebrew analyzers in. If you already have a plugin you wish to use, skip to the next step.
ElasticSearch plugins are compiled Java packages you simply drop to the plugins folder of your ElasticSearch installation and it gets detected automatically by the ElasticSearch instance once it is initialized. If you are new to this, you might want to read up a bit on that in the official ElasticSearch documentation. Here is a great guide to start with: http://jfarrell.github.io/
The gist of this is having a Java project with a es-plugin.properties file embedded as a resource and pointing to class that tells ElasticSearch what classes to load as plugins, and their plugin type. In the next section we will use this to add our own Analyzer implementation which makes use of HebMorph's capabilities.
3. Creating an Hebrew Analyzer
HebMorph already comes with MorphAnalyzer - an Analyzer implementation which takes care of Hebrew-aware tokenization, lemmatization and whatnot. Because it is highly configurable, personally I prefer re-implementing it in the ElasticSearch plugin so it is easier to change the configurations in code. In case you wondered, I'm not planning in supporting external configurations for this as it is too subtle and you should really know what you are doing there.
Don't forget to add dependencies to hebmorph-core and hebmorph-lucene to your project.
My common Analyzer setup for Hebrew search looks like this:
public abstract class HebrewAnalyzer extends ReusableAnalyzerBase {
protected enum AnalyzerType {
INDEXING, QUERY, EXACT
}
private static final DictRadix<Integer> prefixesTree = LingInfo.buildPrefixTree(false);
private static DictRadix<MorphData> dictRadix;
private final StreamLemmatizer lemmatizer;
private final LemmaFilterBase lemmaFilter;
protected final Version matchVersion;
protected final AnalyzerType analyzerType;
protected final char originalTermSuffix = '$';
static {
try {
dictRadix = Loader.loadDictionaryFromHSpellData(new File(resourcesPath + "hspell-data-files"), true);
} catch (IOException e) {
// TODO log
}
}
protected HebrewAnalyzer(final AnalyzerType analyzerType) throws IOException {
this.matchVersion = matchVersion;
this.analyzerType = analyzerType;
lemmatizer = new StreamLemmatizer(null, dictRadix, prefixesTree, null);
lemmaFilter = new BasicLemmaFilter();
}
@Override
protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
// on query - if marked as keyword don't keep origin, else only lemmatized (don't suffix)
// if word termintates with $ will output word$, else will output all lemmas or word$ if OOV
if (analyzerType == AnalyzerType.QUERY) {
final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter);
src.setAlwaysSaveMarkedOriginal(true);
src.setSuffixForExactMatch(originalTermSuffix);
TokenStream tok = new SuffixKeywordFilter(src, '$');
return new TokenStreamComponents(src, tok);
}
if (analyzerType == AnalyzerType.EXACT) {
// on exact - we don't care about suffixes at all, we always output original word with suffix only
final HebrewTokenizer src = new HebrewTokenizer(reader, prefixesTree, null);
TokenStream tok = new NiqqudFilter(src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new AlwaysAddSuffixFilter(tok, '$', false);
return new TokenStreamComponents(src, tok);
}
// on indexing we should always keep both the stem and marked original word
// will ignore $ && will always output all lemmas + origin word$
// basically, if analyzerType == AnalyzerType.INDEXING)
final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter);
src.setAlwaysSaveMarkedOriginal(true);
TokenStream tok = new SuffixKeywordFilter(src, '$');
return new TokenStreamComponents(src, tok);
}
public static class HebrewIndexingAnalyzer extends HebrewAnalyzer {
public HebrewIndexingAnalyzer() throws IOException {
super(AnalyzerType.INDEXING);
}
}
public static class HebrewQueryAnalyzer extends HebrewAnalyzer {
public HebrewQueryAnalyzer() throws IOException {
super(AnalyzerType.QUERY);
}
}
public static class HebrewExactAnalyzer extends HebrewAnalyzer {
public HebrewExactAnalyzer() throws IOException {
super(AnalyzerType.EXACT);
}
}
}
You may notice how I created 3 separate analyzers - one for indexing, one for querying and the last for exact querying. I'll be talking more about this in future posts, but the idea is to be able to provide flexibility on querying while still allow for correct indexing.
Configuring the analyzers to be picked up from ElasticSearch is rather easy now. First, you need to wrap each analyzer in a "provider", like so:
public class HebrewQueryAnalyzerProvider extends AbstractIndexAnalyzerProvider<HebrewAnalyzer.HebrewQueryAnalyzer> {
private final HebrewAnalyzer.HebrewQueryAnalyzer hebrewAnalyzer;
@Inject
public HebrewQueryAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) throws IOException {
super(index, indexSettings, name, settings);
hebrewAnalyzer = new HebrewAnalyzer.HebrewQueryAnalyzer();
}
@Override
public HebrewAnalyzer.HebrewQueryAnalyzer get() {
return hebrewAnalyzer;
}
}
After you've created such providers for all types of analyzers, create an AnalysisBinderProcessor like this (or update your existing one with definitions for the Hebrew analyzers):
public class MyAnalysisBinderProcessor extends AnalysisModule.AnalysisBinderProcessor {
private final static HashMap<String, Class<? extends AnalyzerProvider>> languageAnalyzers = new HashMap<>();
static {
languageAnalyzers.put("hebrew", HebrewIndexingAnalyzerProvider.class);
languageAnalyzers.put("hebrew_query", HebrewQueryAnalyzerProvider.class);
languageAnalyzers.put("hebrew_exact", HebrewExactAnalyzerProvider.class);
}
public static boolean analyzerExists(final String analyzerName) {
return languageAnalyzers.containsKey(analyzerName);
}
@Override
public void processAnalyzers(final AnalyzersBindings analyzersBindings) {
for (Map.Entry<String, Class<? extends AnalyzerProvider>> entry : languageAnalyzers.entrySet()) {
analyzersBindings.processAnalyzer(entry.getKey(), entry.getValue());
}
}
}
Don't forget to update your Plugin class to catch the AnalysisBinderProcessor - it should look something like this (plus any other stuff you want to add there):
public class MyPlugin extends AbstractPlugin {
@Override
public String name() {
return "my-plugin";
}
@Override
public String description() {
return "Implements custom actions required by me";
}
@Override
public void processModule(Module module) {
if (module instanceof AnalysisModule) {
((AnalysisModule)module).addProcessor(new MyAnalysisBinderProcessor());
}
}
}
4. Using the Hebrew analyzers
Compile the ElasticSearch plugin and drop it along with its dependencies in a folder under the /plugins folder of ElasticSearch. You now have 3 new types of analyzers at your disposal: "hebrew", "hebrew_query" and "hebrew_exact".
For indexing, you want to use the "hebrew" analyzer. In your mapping, you can define a certain field or an entire set of fields to use that specific analyzer by setting the analyzer for that field. You can also leave the analyzer configuration blank, and specify the analyzer to use for those fields with unspecified analyzer using the _analyzer field in the index request. See more about both here and here.
The "hebrew" analyzer will expand each term to all recognized lemmas; in case the word wasn't recognized it will try to tolerate spelling errors or missing Yud/Vav - most of the time it will be successful (with some rate of false positives, which the lemma-filters should remove to some degree). Some words will still remain unrecognized and thus will be indexed as-is.
When querying using a QueryString query you can specify what analyzer to use - use the "hebrew_query" or "hebrew_exact" analyzer. The former will perform lemma expansion similar to the indexing analyzer, and the latter will avoid that and allow you to perform exact matches (useful when searching for names or exact phrases).
I pretty much ignored a lot of the complexity involved in fine tuning searches for Hebrew, and many very cool things HebMorph allows you to do with Hebrew search for the sake of focus. I will revisit them in a later blog post.
5. Administration
The hspell dictionary files are looked up by a physical location on disk - you will need to provide a path they are saved at. Since dictionaries update, it is sometimes easier to update them that way in a distributed environment like the one I'm working with. It may be desirable to have them compiled within the same jar file as the code itself - I'll be happy to accept a pull request to do that.
The code above is working with ElasticSearch 0.90 GA and Lucene 4.2.1. I also had it running on earlier versions of both technologies, but may had to make a few minor changes. I assume the samples would break on future versions and I'll probably don't have much time going back and keeping it up to date, but bear in mind most of the time the changes are minor and easy to understand and make by yourself.
Both HebMorph and the hspell dictionary are released under the AGPL3. For any questions on licensing, feel free to contact me.
Hacking with RavenDB’s multi-maps
A couple of months ago I blogged about Orev - OpenRelevance viewer. The purpose of Orev, in short, is to create materials and a sandbox that allow to measure relevance between different full-text search methods.
In Orev we have Corpora, Topics and Judgments. A user is shown a Topic (= a few sentences describing something), and a Corpus Document, and he has to make a Judgment - whether the document is relevant to this Topic or not. By having a lot of judgments on a lot of corpora, using a lot of topics, we can perform automatic searches with different methods, and measure their relevance.
Orev was built using RavenDB as it's back-store, and in this post I'm going to show a nice approach we used to facilitate the judging process.
The Model
To start with, the model is a very simple one - we have Topic, User and Corpus, and of course we have a Judgment.
A Corpus has many CorpusDocuments, which are saved separately and not within the Corpus document itself. This is done for many reasons: they are different transactional units (if I update a typo in a document, the entire Corpus doesn't really change), and we want to retrieve one single document at a time when judging. Also, containing all documents within a parent Corpus document will bloat that document tremendously. So, for all intents and purposes, each CorpusDocument has to be stored as its own document.
And this is how they look:
public class Corpus
{
public string Id { get; set; }
[Required]
public string Name { get; set; }
[Required]
public string Description { get; set; }
[Required]
[StringLength(5, MinimumLength = 5)]
public string Language { get; set; } // a language identifier string, en-US for example
}
public class CorpusDocument
{
public string Id { get; set; }
public string CorpusId { get; set; }
public string Title { get; set; }
public string Content { get; set; }
public string InternalUniqueName { get; set; } // to allow us to track original name in the imported corpus
}
public class Topic
{
public string Id { get; set; }
[Required]
public string Title { get; set; }
[Required]
[DataType(DataType.MultilineText)]
public string Description { get; set; }
[Required]
[DataType(DataType.MultilineText)]
public string Narrator { get; set; }
[Required]
[StringLength(5, MinimumLength = 5)]
public string Language { get; set; } // a language identifier string, en-US for example
/// <summary>
/// Id of user submitting this topic
/// </summary>
public string UserId { get; set; }
}
public class Judgment
{
public enum Verdict
{
Relevant,
NotRelevant,
Skip,
};
[Required]
public string CorpusId { get; set; }
[Required]
public string DocumentId { get; set; }
[Required]
public string TopicId { get; set; }
[Required]
public string UserId { get; set; }
[Required]
public Verdict UserJudgement { get; set; }
}
The Problem
We deployed the application, and imported a lot of Topics, Corpora and CorpusDocuments. Now we want to start generating Judgments. So we let our user select the Corpus he wants to work on, and a Topic to judge CorpusDocuments against. But once we start the judgment process, how can we pull the next CorpusDocument? remember, we have to find one in the selected Corpus that hasn't been judged yet for the selected Topic.
Before jumping ahead to the solution, try to think how you would solve this yourself. Hint: it involves multi-maps.
The Solution
At first glance it seems the query we are going to issue is going to ask RavenDB questions about Judgments. More specifically, it is going to ask it for all Judgments that were not yet made for a specific CorpusDocument and Topic. But how can we query on documents that do not exist?
And then we realize that we are actually querying for a CorpusDocument: when judging, I don't care about other judgments, all I want is to get the next CorpusDocument to show to the user. Another realization is that if I look on all the Judgments made on a specific CorpusDocument, I can get a list of Topics it has been judged against, and perhaps work my way from that. If only I could consolidate both... hmm...
So this is where RavenDB's multi-maps come in. I select all Judgments with their Topic ID within an array, and all CorpusDocuments each with an empty array. This will result in one big set of rows, with each row containing the CorpusDocument ID (which is a document ID + the corpus ID) and one Topic ID there exists a Judgment for. The reason I'm selecting Topics as an array in this stage, is to comply with the format we will produce results in the Reduce step; RavenDB requires all Map and Reduce functions to have the same type of output.
It is important to note ALL corpus documents will be listed, but there may be corpus documents with no topics at all - they will be represented by one row with the CorpusDocument ID, and with an empty string as the Topic ID.
The next thing we want to do is to perform a Reduce step on that set of rows. Notice that if we group all the rows based on the CorpusDocument ID (which includes the Corpus ID), we can have a smaller set of rows, where a CorpusDocument is represented only once, and along with it all the Topic IDs there are Judgments for. So, if we previously had a lot of rows, each row with one CorpusDocument identifier and one Topic identifier, we now consolidated all the data we have for each CorpusDocument into one row per CorpusDocument. And this is exactly what we want to have.
Hence, we write this index:
public class CorpusDocuments_ByNextUnrated : AbstractMultiMapIndexCreationTask<CorpusDocuments_ByNextUnrated.ReduceResult>
{
public class ReduceResult
{
public string DocumentId { get; set; }
public string CorpusId { get; set; }
public string[] Topics { get; set; }
}
public CorpusDocuments_ByNextUnrated()
{
AddMap<CorpusDocument>(docs => from corpusDoc in docs
select new { DocumentId = corpusDoc.Id, CorpusId = corpusDoc.CorpusId, Topics = new[] {string.Empty} }
);
AddMap<Judgment>(judgments => from j in judgments
select new { DocumentId = j.DocumentId, j.CorpusId, Topics = new[] { j.TopicId } });
Reduce = results => from result in results
group result by new { result.DocumentId, result.CorpusId }
into g
select new
{
DocumentId = g.Key.DocumentId,
CorpusId = g.Key.CorpusId,
Topics = g.SelectMany(x => x.Topics).Distinct().ToArray(),
};
TransformResults = (db, results) => from result in results
let doc = db.Load<CorpusDocument>(result.DocumentId)
select doc;
}
}
Now we have an index which contains all the info we need: the CorpusDocument ID, the ID of the Corpus it belongs to, and the list of topics with judgments for each CorpusDocument, where all CorpusDocuments exist, even if they were never judged for any Topic. Performing the actual query is now just a matter of performing a match-all-docs-except query:
var query = RavenSession.Advanced.LuceneQuery<CorpusDocument, CorpusDocuments_ByNextUnrated>()
.Where("Topics:*") // match all docs
.AndAlso()
.WhereEquals("CorpusId", corpusId)
.AndAlso()
.Not
.WhereEquals("Topics", topicId) // remove corpus docs with a particular TopicId attached to them
.RandomOrdering()
.FirstOrDefault();
This will issue a Lucene query like this: Topics:* AND CorpusId:corpus/1 AND -Topics:topics/1
This query will first match all index documents from the given corpus, and then will remove all CorpusDocuments which have a given TopicId attached to them. The way we built the index, if a CorpusDocument has a certain TopicId attached to it in the index, that means a Judgment has previously been made to it; and if a CorpusDocument has already been judged for our Topic, we are not interested in it anymore.
And just to spice things up a bit, I threw in RandomSorting().
חיפוש עברי בספריה הלאומית
"כל מאגרי הספריה הלאומית, עכשיו באינטרנט", זעקו הכותרות. כחובב טקסטים, הלכתי לראות על מה מדובר.
באתר הספריה (http://web.nli.org.il) יש גישה לקטלוג ולארכיונים שונים, כאשר בראש האתר עומדת תיבת טקסט לחיפוש חופשי. כמובן שזה הדבר הראשון שניסיתי באתר...
ובכן, עושה רושם שבעיית החיפוש העברי אכן היתה ידועה ונלקחה בחשבון בבניית האתר. נראה שאיזו שהיא תשומת לב אכן ניתנה לטיפול מורפולוגי כלשהו, אך חבל שהתוצאות רחוקות מלהיות טובות, ואפילו נכונות.
כמה דוגמאות מייצגות ומסקנותיהן (בקצרה) בצידן:
- חיפוש עבור "רבין" מביא תוצאות לא רלוונטיות כלל ב-6 התוצאות הראשונות (עם המילה "רביניו" מודגשת). הקלטת שמע מאת עוזר רבין מופיעה שביעית, ראשונה מבין התוצאות עבור "רבין". זהו recall גרוע במיוחד. הסיבה לכך היא מתן משקל זהה לצורות מדויקות וצורות החשודות כדומות, וכדאי לשים לב שמדובר על מילה בעלת הטיות אפשריות מעטות מאד.
- אותיות מש"ה וכל"ב כלל לא מטופלות כראוי - חיפוש עבור "הלב" לא מחזיר תוצאות בהן מופיעה המילה "לב", ומאוחזרות רק הטיות של המילה "לב" עם התחילית ה'. זו אינה הדרך הנכונה הנכונה לבצע זאת - נרצה לדרג אחזורים מדוייקים גבוה יותר, אך לא לאבד אחזורים רלוונטיים שנכתבו במקור ללא אותיות מש"ה וכל"ב.
- גרשיים. לא נתמכים. בכלל. חיפוש עבור צה"ל, רמב"ם, רמב"ן לא מניב אף תוצאה (אבל צהל, רמבם כן).
- כתיב מלא / חסר - לא נתמך כלל. חיפושים עבור אמא / אימא, חנוכיה / חנוכייה, ספריה / ספרייה ועוד מחזירים תוצאות שונות לחלוטין.
כל הדוגמאות הנ"ל גורמות לי להאמין שמדובר על query expansion מסוג כלשהו, ובכל אופן ברור שמדובר על מנוע חיפוש קליל ביותר עבור מאגר הספרים הלאומי. החיפוש אינו ממצה, ובעל precision & recall נמוכים ביותר. בכמה הרצאות שנתתי בנושא כבר הראיתי דוגמאות לכך באתרים כמו ווינט, ויקיפדיה העברית ותפוז, אך דווקא מהספריה הלאומית ציפיתי ליותר...
פרוייקט HebMorph, עליו ניתן לקרוא הרבה גם באתר זה, נועד בדיוק למטרה זו, והוא בקוד פתוח (עם אופציה לשימוש מסחרי). בשימוש קצר ב-demo החי ניתן להתרשם מכך שהמנוע כבר מטפל גם בנקודות שאוזכרו...
Practical Hebrew search – Open2011 presentation
Attached with this post is the presentation I gave today at Open2011 in Tel-Aviv.
The sample app can be found here: http://hebmorph.code972.com/. It is also going to be HebMorph's home in a few weeks when I'll be done generating all the necessary content.
As promised, I will be posting more details on some interested findings on Hebrew search, and comparisons with Google search. I want to have a bit more comprehensive posts about that, so it will be up in a few weeks time.
HebMorph at SIGTRS 07/10
Today I gave a talk at SIGTRS on Hebrew search and HebMorph. Attached with this post is the slideshow from the presentation. More info on HebMorph is accessible through the project's page.
A PDF with the presentation summary in Hebrew is available as well (6 pages): HebMorph SIGTRS presentation summary. It describes what exactly HebMorph is, what problems it tries to solve, and how.
More flexible Hebrew indexing with HebMorph
In the past week I've been working on making Hebrew indexing with HebMorph more flexible. Now it is possible to perform different type of searches, and also control the way lemmas are filtered. You can also perform exact searches and morphological searches on one field, without indexing the contents twice. See below for more details on how its done.
Open-source Hebrew information retrieval (HebMorph, part 3)
Indexing Hebrew texts for later retrieval is not a trivial task. Although several solutions exist, I have pointed out that they are not necessarily providing the best results. Either way, there is no freely available solution allowing to index Hebrew even at the very basic level.
HebMorph was started with this in mind. It is a free, open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals. During the work on this project, we will try and come up with different approaches to indexing Hebrew, and provide the tools to perform reliable comparisons between them. This project's ultimate goal is providing various IR libraries with the best Hebrew IR capabilities possible.
Finding Hebrew lemmas (HebMorph, part 2)
As shown in the previous post, building a Hebrew-aware search engine is not trivial. Several attempts (mainly commercial) were made to deal with that. In this post I'm going to try and draw a complete picture of what they did, and show other routes that may exist. In the next post I'll discuss HebMorph itself.
Challenges with indexing Hebrew texts (HebMorph, part 1)
Unfortunately, there is no magic trick for correctly indexing and searching Hebrew texts. Semitic languages like Hebrew, Arabic, and Aramaic are the hardest to morphologically analyze and disambiguate, and as a result creating a perfect IR solution for them, if at all possible, requires a lot of research and a very long process of trial and error. Some claim Hebrew is the most complex language of all from an NLP perspective. I don't know other Semitic languages well enough to comment on this, but I do know Hebrew to be complicated enough...
Since someone had to do this lengthy and tiresome work someday, I decided to go forward and do the heavy lifting myself instead of waiting for someone else to pick it up. That, and the fact I needed such a solution for another product I'm working on. This effort - HebMorph - is all about making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. As of this writing, it is still in a design phase, and is available from the github repository.
In a series of posts, I'm going to investigate this subject, and hopefully draw a complete picture. I'll start by explaining Hebrew morphology and how it affects common IR methods. From there, I'll present several possible ways to attack the problem, and finally discuss what exactly HebMorph does and what are its goals and roadmap.