Code972 Coding from the back of a camel

HebMorph

Indexing Hebrew texts for later retrieval is not a trivial task. Although several solutions exist, they don't necessarily provide the best results in terms of relevancy. Either way, there is no freely available solution allowing to index Hebrew even at the very basic level.

HebMorph was started with this in mind. It is a free, open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. During the work on this project, we will try and come up with different approaches to indexing Hebrew, and provide the tools to perform reliable comparisons between them. This project's ultimate goal is providing various IR libraries with the best Hebrew IR capabilities possible.

Apache Lucene has been selected to be our planning and testing framework. This is thanks to its advanced capabilities, flexibility, and the author's familiarity with it. During these initial steps, .NET code is being written and used with Lucene.Net (a .Net port of Java Lucene). Once the project stabilizes enough, ports to other languages will be followed.

More detailed information on why this project is important can be found in a series of 3 blog posts: Challenges with indexing Hebrew texts (HebMorph, part 1), Finding Hebrew lemmas (HebMorph, part 2) and Open-source Hebrew information retrieval (HebMorph, part 3). The project's roadmap is in the last part.

The new HebMorph home, which is still being populated with content: http://hebmorph.code972.com

Resources

Live demo: http://hebmorph.code972.com

SourceForge home: https://sourceforge.net/projects/hebmorph/

Code repository: http://github.com/synhershko/HebMorph

Think-tank mailing list for discussion and planning: https://lists.sourceforge.net/lists/listinfo/hebmorph-thinktank

Updates

02/07/2010 - Exact and morphological searches can now be performed, both on a single field. A new flexible LemmaFilter mechanism has been introduced. Preliminary results look good.

13/07/2010Testing of HSpell's language coverage provides promising results. Most unrecognized words are invalid, the rest should help perfect HebMorph's toleration mechanism or go into hspell's dictionary.

18/07/2010 - The Wikipedia browser and searcher application BzReader is now bundled with HebMorph, to provide better Hebrew searches - http://www.code972.com/blog/2010/07/wikipedia-offline-reader-supporting-hebrew/.

22/07/2010 - HebMorph was presented at SIGTRS. Slideshow and Hebrew project description are available.

27/07/2010 - A flowchart shedding some light on the inner working of HebMorph is available through the project wiki on github, at http://wiki.github.com/synhershko/HebMorph/hebmorph-flowchart

Other uses

The tools provided in HebMorph can be used for a large variety of other things, including:

  1. A .NET Hebrew spellchecker
  2. OCR automatic corrections and probabilistic recognition, using the dictionary and optionally the tolerator functions.
  3. Basis for POS tagging software.

License

HebMorph is copyright (C) 2010-2011, Itamar Syn-Hershko.

Some parts of HebMorph are powered by hspell, copyright (C) 2000-2011, Nadav Har'El and Dan Kenigsberg (http://hspell.ivrix.org.il/).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License v3 as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful,but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See theGNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public Licensealong with this program. If not, see <http://www.gnu.org/licenses/>.

Note that not only the programs in the distribution, but also the dictionary files and the generated word lists, are licensed under the AGPLv3. There is no warranty of any kind for the contents of this distribution.

The hspell dictionary files distributed with HebMorph are provided with the license to be used ONLY for search by HebMorph. To get an official hspell distribution under the GPLv2 license, visit their site.

If you are interested in using this product commercially or without releasing your source-code as AGPLv3, please contact the author.

Comments (20) Trackbacks (6)
  1. I’m running a search engine that uses the package, and it does work pretty smoothly with Lucene, but how can we integrate none hebrew words? (e.g. סינמה סיטי)
    Great job so far!

    • Thanks.

      You are going to need to inject all words not in the hspell dictionary directly into the Radix we use internally, along with correct morphologic properties for each. However, this is not a trivial task, and since this also requires versioning the index, this is an important item in my TODO list for this project.

  2. do you have intention develop the project also to entity extraction ?

    • HebMorph is focusing on developing a proper Hebrew search. Entity extraction is using language independent techniques, so in a complete product all I’ll have to do for entity extraction is grab an implementation from somewhere else…

  3. More about entity extraction: can you estimate when it will be ready?

  4. Hi Itamar.
    Is there an example of using the HebMorph with Lucene (wrapping it under an Analyzer)?

    Best Regards
    Elad Ash

  5. Let me rephrase, is there an analyzer under the Lucene.Net…

  6. Hi,

    I just downloaded the new 2.9.4 lucene.net, and now my project fail to build becuase lucene.net.analysis is missing the 2.9.2 reference.

    Are you plann to make a fix or i need to download the src and compile the solution again?

    • Updating the project references and compiling the sources against those binaries would do. I will not update it myself since not everybody is using the latest Lucene.NET. I will probably move to using nuget though, it will make it easier to upgrade.

      Out of curiosity, how do you use HebMorph? can you please email me with the details?

      • We are using it at cet.
        Talk to Oren Eini he was here consulting us on RavenDB and lucene.net

        Also feel free to email me and I would tell more how we are using Hebmorph with Lucene.net

        Anyway i will download the src and compile them against the new lucene.net version

  7. Hi Itamar,

    I’m planning to run a knowledge base site and I want to use HebMorph.
    How can we discuss about the details for comercial use?

    Thanks,

    Tomer

  8. Hi,

    Any plans on incorporating HSpell 1.2 into hebmorph?

    Thanks,
    Daniel


Leave a comment

(required)