Code972 Coding from the back of a camel

23Jul/102

HebMorph at SIGTRS 07/10

Today I gave a talk at SIGTRS on Hebrew search and HebMorph. Attached with this post is the slideshow from the presentation. More info on HebMorph is accessible through the project's page.

A PDF with the presentation summary in Hebrew is available as well (6 pages): HebMorph SIGTRS presentation summary. It describes what exactly HebMorph is, what problems it tries to solve, and how.

Tagged as: , 2 Comments
19Jul/100

Twitter is using BitTorrent internally for faster deployment

This is what they revealed in a video floating around lately. Instead of sending thousands of git pull requests to each of their deployment servers , Twitter started using the BitTorrent protocol from Python to distribute the binaries in their deployment cycle. They report a drastic speed improvement.

They call this "Murder", and the code is freely available at http://github.com/lg/murder.

Here's Larry Gadea describing Murder:

Filed under: English posts No Comments
18Jul/100

Wikipedia offline reader with Hebrew search support

BzReader (http://code.google.com/p/bzreader/) is a simple utility which allows browsing dump files downloaded from Wikipedia. Once downloaded, BzReader will go through all pages and articles in the dump file and index their titles. Using BzReader, it is easy to browse and search Wikipedia for specific topics, and once found a topic, to read it directly from the application. At the moment, the actual page contents aren't being indexed, only their titles.

I went ahead and forked the project, so I could add some extra functionalities more easily. For now I just updated the original code base to work with Lucene.Net 2.9.2 (the latest, instead of a very old version of it), and added better search support for Hebrew dumps with the help of HebMorph's Lucene.Net integration (see: code972.com/blog/hebmorph).

The updated code can be found here: http://github.com/synhershko/BzReader. Read the instructions there before compiling.

Here's a screenshot demonstrating how Hebrew searches were drastically improved after plugging HebMorph in. The search was for the Hebrew word "test" (noun). When used with StandardAnalyzer, only exact matches were found. When indexed and searched with HebMorph, also constructs and plurals of the word were found, for example "blood test" and "software tests":

Comparing Hebrew searches using StandardAnalyzer and HebMorph, via BzReader

13Jul/100

Testing hspell’s language coverage using Wikipedia

As part of the HebMorph project, I needed to test hspell's dictionary on a large modern corpus. Knowing how many words it can recognize is very important, and below I'll be explaining exactly why.

The project, along with usage instructions, is released under the GNU GPL and available from here. The report (zipped XML) is available here.

2Jul/1011

More flexible Hebrew indexing with HebMorph

In the past week I've been working on making Hebrew indexing with HebMorph more flexible. Now it is possible to perform different type of searches, and also control the way lemmas are filtered. You can also perform exact searches and morphological searches on one field, without indexing the contents twice. See below for more details on how its done.