More flexible Hebrew indexing with HebMorph

English posts, HebMorph, IR

Comments

3 min read
In the past week I've been working on making Hebrew indexing with HebMorph more flexible. Now it is possible to perform different type of searches, and also control the way lemmas are filtered. You can also perform exact searches and morphological searches on one field, without indexing the contents twice. See below for more details on how its done. HebMorph now contains two new entities: Lucene.Analysis.Hebrew.SimpleAnalyzer and HebMorph.LemmaFilter.

The Hebrew SimpleAnalyzer

Lucene.Analysis.Hebrew.SimpleAnalyzer performs, as its name suggests, simple analysis only. It calls HebMorph.Tokenizer to perform text tokenization, and then passes the tokens through a NiqqudFilter (to remove Niqqud characters), StopFilter (along with a list of Hebrew stop-words) and a LowerCaseFilter (to normalize non-Hebrew tokens). The tokenization process also tries to remove certain noise cases - which are unique to Hebrew - where its obvious a token is not a real word. The MorphAnalyzer has a similar process in place, but it also makes use of meta data returned for each Token. In addition, it uses HebMorph.Lemmatizer to compile a list of possible lemmas, and indexes them along with the original term (which is being marked as such). By default, the original term is stored only when more than one lemma is returned for a term. To enable dual searches the morphological analyzer needs to store the original tokens for all cases (since SimpleAnalyzer performs no lemmatization), and to mark them accordingly. SimpleAnalyzer, in turn, needs to be aware of what is an original term and what is not. To achieve that, SimpleAnalyzer now uses AddSuffixFilter to "stick" a $ sign to the end of each analyzed term. MorphAnalyzer, in turn, is  asked to append this char to all original terms, and not only to cases where there is more than one possible lemma. To use this analyzer duality, fields need to be created with MorphAnalyzer with MorphAnalyzer.alwaysSaveMarkedOriginal = true (false by default). When using the same analyzer for search, it is recommended to set it again to false. To use Hebrew.SimpleAnalyzer to search these fields, register all the Hebrew tokens with the "$" suffix as follows: [code lang="csharp"] SimpleAnalyzer an = new SimpleAnalyzer(); an.RegisterSuffix(HebrewTokenizer.TokenTypeSignature(HebrewTokenizer.TOKEN_TYPES.Hebrew), "$"); an.RegisterSuffix(HebrewTokenizer.TokenTypeSignature(HebrewTokenizer.TOKEN_TYPES.Acronym), "$"); an.RegisterSuffix(HebrewTokenizer.TokenTypeSignature(HebrewTokenizer.TOKEN_TYPES.Construct), "$"); [/code]

Flexible lemma filtering

Until today, no filtering was done to the collection of lemmas returned from the lemmatizer. Since the toleration mechanism often returns some very wrong lemmas, we needed a way of moderating the lemmas accepted by the morphological analyzer. This is where HebMorph.LemmaFilter comes in. It is a class accepting a collection of HebMorph.Token objects, and returning only tokens which passed the checks done by its member function IsValidToken. A LemmaFilter is "pluggable" into MorphAnalyzer, and can also be used to create more focused searches (or expand them further if required). Since the LemmaFilter object is aware of all token properties returned from the lemmatizer, it is quite a powerful mechanism, alowing the consumer

Practical use

Using these new features, preliminary results for searches ran on a corpus of about 15,000 Hebrew documents loo pretty good. Indexing is done with MorphAnalyzer to create one field only, and then searches are performed in one of two flavors: exact (using Hebrew.SimpleAnalyzer) to find names or exact phrases like שיר השירים,  and morphological (using MorphAnalyzer) to find topics or non-exact words.

Comments

  • ofer

    very impressive work, do you have an idea when it will be stable enough to port to java?

  • Adam

    Looks nice. Any simple way you know of just to do searching that ignores nikkud?

    Thanks!

  • Manoj

    Looks really good. I am trying to evaluate using HebMorph to do hebrew search using solr in our application. I am able to build the Jar files but am not able to make solr use lucene.analysis.hebrew.MorphAnalyzer. I get a run-time exception shown below. Any idea what is doing wrong ? I am runing Solr 1.4.1( Lucene 2.9.3), -Thanks

    Nov 22, 2011 5:38:51 PM org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException: org.apache.lucene.analysis.hebrew.MorphAnalyzer cannot be cast to org.apache.lucene.analysis.Analyzer at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:759) at org.apache.solr.schema.IndexSchema.access$100(IndexSchema.java:58) at org.apache.solr.schema.IndexSchema$1.create(IndexSchema.java:429)

  • Nicolas

    Nice work!!!

    I use hebmorph to index some data with the alwaysSaveMarkedOriginal set to true.

    But when searching exact phrase with the Hebrew.SimpleAnalyzer, its some times remove some words.

    i.e. searching for "הספר של" results a TermQuery liek this "הספר$" instead of "של $ הספר$".

    Any ideas how i can force the QueryParser to not drop some words??

    Thanks Nicolas

Comments are now closed