Wikipedia offline reader with Hebrew search support

July 18th, 2010 English posts, HebMorph, Lucene.Net

2 min read

BzReader (http://code.google.com/p/bzreader/) is a simple utility which allows browsing dump files downloaded from Wikipedia. Once downloaded, BzReader will go through all pages and articles in the dump file and index their titles. Using BzReader, it is easy to browse and search Wikipedia for specific topics, and once found a topic, to read it directly from the application. At the moment, the actual page contents aren't being indexed, only their titles. I went ahead and forked the project, so I could add some extra functionalities more easily. For now I just updated the original code base to work with Lucene.Net 2.9.2 (the latest, instead of a very old version of it), and added better search support for Hebrew dumps with the help of HebMorph's Lucene.Net integration (see: code972.com/blog/hebmorph).

The updated code can be found here: http://github.com/synhershko/BzReader. Read the instructions there before compiling.

Here's a screenshot demonstrating how Hebrew searches were drastically improved after plugging HebMorph in. The search was for the Hebrew word "test" (noun). When used with StandardAnalyzer, only exact matches were found. When indexed and searched with HebMorph, also constructs and plurals of the word were found, for example "blood test" and "software tests":

[caption id="attachment_132" align="aligncenter" width="300" caption="Comparing Hebrew searches using StandardAnalyzer and HebMorph, via BzReader"]

[/caption]

Code 972

Wikipedia offline reader with Hebrew search support

Comments are now closed