HebMorph
Indexing Hebrew texts for later retrieval is not a trivial task. Although several solutions exist, they don't necessarily provide the best results in terms of relevancy. Either way, there is no freely available solution allowing to index Hebrew even at the very basic level.
HebMorph was started with this in mind. It is a free, open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevancy in retrievals. During the work on this project, we will try and come up with different approaches to indexing Hebrew, and provide the tools to perform reliable comparisons between them. This project's ultimate goal is providing various IR libraries with the best Hebrew IR capabilities possible.
Apache Lucene has been selected to be our planning and testing framework. This is thanks to its advanced capabilities, flexibility, and the author's familiarity with it. During these initial steps, .NET code is being written and used with Lucene.Net (a .Net port of Java Lucene). Once the project stabilizes enough, ports to other languages will be followed.
More detailed information on why this project is important can be found in a series of 3 blog posts: Challenges with indexing Hebrew texts (HebMorph, part 1), Finding Hebrew lemmas (HebMorph, part 2) and Open-source Hebrew information retrieval (HebMorph, part 3). The project's roadmap is in the last part.
The new HebMorph home, which is still being populated with content: http://hebmorph.code972.com
Resources
Live demo: http://hebmorph.code972.com
SourceForge home: https://sourceforge.net/projects/hebmorph/
Code repository: http://github.com/synhershko/HebMorph
Think-tank mailing list for discussion and planning: https://lists.sourceforge.net/lists/listinfo/hebmorph-thinktank
Updates
02/07/2010 - Exact and morphological searches can now be performed, both on a single field. A new flexible LemmaFilter mechanism has been introduced. Preliminary results look good.
13/07/2010 - Testing of HSpell's language coverage provides promising results. Most unrecognized words are invalid, the rest should help perfect HebMorph's toleration mechanism or go into hspell's dictionary.
18/07/2010 - The Wikipedia browser and searcher application BzReader is now bundled with HebMorph, to provide better Hebrew searches - http://www.code972.com/blog/2010/07/wikipedia-offline-reader-supporting-hebrew/.
22/07/2010 - HebMorph was presented at SIGTRS. Slideshow and Hebrew project description are available.
27/07/2010 - A flowchart shedding some light on the inner working of HebMorph is available through the project wiki on github, at http://wiki.github.com/synhershko/HebMorph/hebmorph-flowchart
Other uses
The tools provided in HebMorph can be used for a large variety of other things, including:
- A .NET Hebrew spellchecker
- OCR automatic corrections and probabilistic recognition, using the dictionary and optionally the tolerator functions.
- Basis for POS tagging software.
License
HebMorph is copyright (C) 2010-2011, Itamar Syn-Hershko.
Some parts of HebMorph are powered by hspell, copyright (C) 2000-2011, Nadav Har'El and Dan Kenigsberg (http://hspell.ivrix.org.il/).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License v3 as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful,but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See theGNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public Licensealong with this program. If not, see <http://www.gnu.org/licenses/>.
Note that not only the programs in the distribution, but also the dictionary files and the generated word lists, are licensed under the AGPLv3. There is no warranty of any kind for the contents of this distribution.
The hspell dictionary files distributed with HebMorph are provided with the license to be used ONLY for search by HebMorph. To get an official hspell distribution under the GPLv2 license, visit their site.
If you are interested in using this product commercially or without releasing your source-code as AGPLv3, please contact the author.
April 10th, 2011 - 18:29
I’m running a search engine that uses the package, and it does work pretty smoothly with Lucene, but how can we integrate none hebrew words? (e.g. סינמה סיטי)
Great job so far!
April 10th, 2011 - 20:38
Thanks.
You are going to need to inject all words not in the hspell dictionary directly into the Radix we use internally, along with correct morphologic properties for each. However, this is not a trivial task, and since this also requires versioning the index, this is an important item in my TODO list for this project.
April 25th, 2011 - 21:50
do you have intention develop the project also to entity extraction ?
April 26th, 2011 - 02:37
HebMorph is focusing on developing a proper Hebrew search. Entity extraction is using language independent techniques, so in a complete product all I’ll have to do for entity extraction is grab an implementation from somewhere else…
September 6th, 2011 - 17:30
More about entity extraction: can you estimate when it will be ready?
September 7th, 2011 - 01:09
Entity extraction is out of scope for this project, but if you need help for configuring HebMorph with EX technologies (some good OSS exist, too), feel free to contact me directly.
September 7th, 2011 - 13:26
Hi Itamar.
Is there an example of using the HebMorph with Lucene (wrapping it under an Analyzer)?
Best Regards
Elad Ash
September 7th, 2011 - 17:18
Several. In the project itself there is one, and the sources for the sample application are posted in github:
https://github.com/synhershko/HebMorph.CorpusSearcher
https://github.com/synhershko/HebMorph.CorpusReaders
https://github.com/synhershko/BzReader
September 7th, 2011 - 17:19
HebMorph as an accompanying project, Lucene.Analysis.Hebrew, where HebMorph is wrapped with a Lucene analyzer, both in Java and .NET
September 7th, 2011 - 13:30
Let me rephrase, is there an analyzer under the Lucene.Net…
February 6th, 2012 - 17:04
Hi,
I just downloaded the new 2.9.4 lucene.net, and now my project fail to build becuase lucene.net.analysis is missing the 2.9.2 reference.
Are you plann to make a fix or i need to download the src and compile the solution again?
February 6th, 2012 - 17:26
Updating the project references and compiling the sources against those binaries would do. I will not update it myself since not everybody is using the latest Lucene.NET. I will probably move to using nuget though, it will make it easier to upgrade.
Out of curiosity, how do you use HebMorph? can you please email me with the details?
February 7th, 2012 - 16:16
We are using it at cet.
Talk to Oren Eini he was here consulting us on RavenDB and lucene.net
Also feel free to email me and I would tell more how we are using Hebmorph with Lucene.net
Anyway i will download the src and compile them against the new lucene.net version
February 9th, 2012 - 15:40
Hi Itamar,
I’m planning to run a knowledge base site and I want to use HebMorph.
How can we discuss about the details for comercial use?
Thanks,
Tomer
February 9th, 2012 - 16:05
Ping me by email: itamar at this domain
April 30th, 2012 - 15:39
Hi,
Any plans on incorporating HSpell 1.2 into hebmorph?
Thanks,
Daniel
April 30th, 2012 - 15:54
Definitely, should be out in a few weeks time
May 4th, 2012 - 16:19
It is now done
November 27th, 2012 - 20:18
I can’t get this to work. no .project or .classpath file to import pcjeort into Eclipse. When importing it via “Import FileSystem” I get lot’s of errors. Seems like org.springframework.context.support.ApplicationObjectSupport is missing.Buildpath Problem? Can we get a Step-By-Step Tutorial on how to get this thinf to work?
November 27th, 2012 - 23:42
There is no dependency on Spring. It is easiest to open it with IntelliJ.