Indexing Hebrew texts for later retrieval is not a trivial task. Although several solutions exist, they don't necessarily provide the best results in terms of relevancy. Either way, there is no freely available solution allowing to index Hebrew even at the very basic level.
HebMorph was started with this in mind. It is a free, open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. During the work on this project, we will try and come up with different approaches to indexing Hebrew, and provide the tools to perform reliable comparisons between them. This project's ultimate goal is providing various IR libraries with the best Hebrew IR capabilities possible.
Apache Lucene has been selected to be our planning and testing framework. This is thanks to its advanced capabilities, flexibility, and the author's familiarity with it. During these initial steps, .NET code is being written and used with Lucene.Net (a .Net port of Java Lucene). Once the project stabilizes enough, ports to other languages will be followed.
More detailed information on why this project is important can be found in a series of 3 blog posts: Challenges with indexing Hebrew texts (HebMorph, part 1), Finding Hebrew lemmas (HebMorph, part 2) and Open-source Hebrew information retrieval (HebMorph, part 3). The project's roadmap is in the last part.
The new HebMorph home, which is still being populated with content: http://hebmorph.code972.com
Live demo: http://hebmorph.code972.com
SourceForge home: https://sourceforge.net/projects/hebmorph/
Code repository: http://github.com/synhershko/HebMorph
Think-tank mailing list for discussion and planning: https://lists.sourceforge.net/lists/listinfo/hebmorph-thinktank
02/07/2010 - Exact and morphological searches can now be performed, both on a single field. A new flexible LemmaFilter mechanism has been introduced. Preliminary results look good.
13/07/2010 - Testing of HSpell's language coverage provides promising results. Most unrecognized words are invalid, the rest should help perfect HebMorph's toleration mechanism or go into hspell's dictionary.
18/07/2010 - The Wikipedia browser and searcher application BzReader is now bundled with HebMorph, to provide better Hebrew searches - http://www.code972.com/blog/2010/07/wikipedia-offline-reader-supporting-hebrew/.
22/07/2010 - HebMorph was presented at SIGTRS. Slideshow and Hebrew project description are available.
27/07/2010 - A flowchart shedding some light on the inner working of HebMorph is available through the project wiki on github, at http://wiki.github.com/synhershko/HebMorph/hebmorph-flowchart
The tools provided in HebMorph can be used for a large variety of other things, including:
- A .NET Hebrew spellchecker
- OCR automatic corrections and probabilistic recognition, using the dictionary and optionally the tolerator functions.
- Basis for POS tagging software.
HebMorph is copyright (C) 2010-2011, Itamar Syn-Hershko.
Some parts of HebMorph are powered by hspell, copyright (C) 2000-2011, Nadav Har'El and Dan Kenigsberg (http://hspell.ivrix.org.il/).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License v3 as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful,but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See theGNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public Licensealong with this program. If not, see <http://www.gnu.org/licenses/>.
Note that not only the programs in the distribution, but also the dictionary files and the generated word lists, are licensed under the AGPLv3. There is no warranty of any kind for the contents of this distribution.
The hspell dictionary files distributed with HebMorph are provided with the license to be used ONLY for search by HebMorph. To get an official hspell distribution under the GPLv2 license, visit their site.
If you are interested in using this product commercially or without releasing your source-code as AGPLv3, please contact the author.