HebMorph at SIGTRS 07/10
Today I gave a talk at SIGTRS on Hebrew search and HebMorph. Attached with this post is the slideshow from the presentation. More info on HebMorph is accessible through the project's page.
A PDF with the presentation summary in Hebrew is available as well (6 pages): HebMorph SIGTRS presentation summary. It describes what exactly HebMorph is, what problems it tries to solve, and how.
Twitter is using BitTorrent internally for faster deployment
This is what they revealed in a video floating around lately. Instead of sending thousands of git pull requests to each of their deployment servers , Twitter started using the BitTorrent protocol from Python to distribute the binaries in their deployment cycle. They report a drastic speed improvement.
They call this "Murder", and the code is freely available at http://github.com/lg/murder.
Here's Larry Gadea describing Murder:
Wikipedia offline reader with Hebrew search support
BzReader (http://code.google.com/p/bzreader/) is a simple utility which allows browsing dump files downloaded from Wikipedia. Once downloaded, BzReader will go through all pages and articles in the dump file and index their titles. Using BzReader, it is easy to browse and search Wikipedia for specific topics, and once found a topic, to read it directly from the application. At the moment, the actual page contents aren't being indexed, only their titles.
I went ahead and forked the project, so I could add some extra functionalities more easily. For now I just updated the original code base to work with Lucene.Net 2.9.2 (the latest, instead of a very old version of it), and added better search support for Hebrew dumps with the help of HebMorph's Lucene.Net integration (see: code972.com/blog/hebmorph).
The updated code can be found here: http://github.com/synhershko/BzReader. Read the instructions there before compiling.
Here's a screenshot demonstrating how Hebrew searches were drastically improved after plugging HebMorph in. The search was for the Hebrew word "test" (noun). When used with StandardAnalyzer, only exact matches were found. When indexed and searched with HebMorph, also constructs and plurals of the word were found, for example "blood test" and "software tests":
Testing hspell’s language coverage using Wikipedia
As part of the HebMorph project, I needed to test hspell's dictionary on a large modern corpus. Knowing how many words it can recognize is very important, and below I'll be explaining exactly why.
The project, along with usage instructions, is released under the GNU GPL and available from here. The report (zipped XML) is available here.
More flexible Hebrew indexing with HebMorph
In the past week I've been working on making Hebrew indexing with HebMorph more flexible. Now it is possible to perform different type of searches, and also control the way lemmas are filtered. You can also perform exact searches and morphological searches on one field, without indexing the contents twice. See below for more details on how its done.
