חיפוש עברי בספריה הלאומית
"כל מאגרי הספריה הלאומית, עכשיו באינטרנט", זעקו הכותרות. כחובב טקסטים, הלכתי לראות על מה מדובר.
באתר הספריה (http://web.nli.org.il) יש גישה לקטלוג ולארכיונים שונים, כאשר בראש האתר עומדת תיבת טקסט לחיפוש חופשי. כמובן שזה הדבר הראשון שניסיתי באתר...
ובכן, עושה רושם שבעיית החיפוש העברי אכן היתה ידועה ונלקחה בחשבון בבניית האתר. נראה שאיזו שהיא תשומת לב אכן ניתנה לטיפול מורפולוגי כלשהו, אך חבל שהתוצאות רחוקות מלהיות טובות, ואפילו נכונות.
כמה דוגמאות מייצגות ומסקנותיהן (בקצרה) בצידן:
- חיפוש עבור "רבין" מביא תוצאות לא רלוונטיות כלל ב-6 התוצאות הראשונות (עם המילה "רביניו" מודגשת). הקלטת שמע מאת עוזר רבין מופיעה שביעית, ראשונה מבין התוצאות עבור "רבין". זהו recall גרוע במיוחד. הסיבה לכך היא מתן משקל זהה לצורות מדויקות וצורות החשודות כדומות, וכדאי לשים לב שמדובר על מילה בעלת הטיות אפשריות מעטות מאד.
- אותיות מש"ה וכל"ב כלל לא מטופלות כראוי - חיפוש עבור "הלב" לא מחזיר תוצאות בהן מופיעה המילה "לב", ומאוחזרות רק הטיות של המילה "לב" עם התחילית ה'. זו אינה הדרך הנכונה הנכונה לבצע זאת - נרצה לדרג אחזורים מדוייקים גבוה יותר, אך לא לאבד אחזורים רלוונטיים שנכתבו במקור ללא אותיות מש"ה וכל"ב.
- גרשיים. לא נתמכים. בכלל. חיפוש עבור צה"ל, רמב"ם, רמב"ן לא מניב אף תוצאה (אבל צהל, רמבם כן).
- כתיב מלא / חסר - לא נתמך כלל. חיפושים עבור אמא / אימא, חנוכיה / חנוכייה, ספריה / ספרייה ועוד מחזירים תוצאות שונות לחלוטין.
כל הדוגמאות הנ"ל גורמות לי להאמין שמדובר על query expansion מסוג כלשהו, ובכל אופן ברור שמדובר על מנוע חיפוש קליל ביותר עבור מאגר הספרים הלאומי. החיפוש אינו ממצה, ובעל precision & recall נמוכים ביותר. בכמה הרצאות שנתתי בנושא כבר הראיתי דוגמאות לכך באתרים כמו ווינט, ויקיפדיה העברית ותפוז, אך דווקא מהספריה הלאומית ציפיתי ליותר...
פרוייקט HebMorph, עליו ניתן לקרוא הרבה גם באתר זה, נועד בדיוק למטרה זו, והוא בקוד פתוח (עם אופציה לשימוש מסחרי). בשימוש קצר ב-demo החי ניתן להתרשם מכך שהמנוע כבר מטפל גם בנקודות שאוזכרו...
Orev: The Apache OpenRelevance Viewer
It has been quite a some time since I said I'll be working on this, as I got caught on other pressing matters and had to drop it for a while. But it is all for the best. The technology I used for this new version is just a perfect fit for this application, and it wasn't available then. I'll be addressing the technical aspects later in this post and also in some follow-up posts.
My first interest in the OpenRelevance project, and one of the main reasons I created Orev, was the HebMorph project. Using Orev, I'm hoping to be able to create an environment where tools for Hebrew IR can be tested and compared, to produce the ultimate Hebrew analyzer, for Lucene and other libraries as well.
Before anything else, the complete source code is available at https://github.com/synhershko/Orev.
I have a hosted version too which I will publish a link to soon, once I get some things sorted out and some feedback from other people who were involved in this project.
What is this?
The OpenRelevance project is an Apache project, aimed at making materials for doing relevance testing for information retrieval (IR), Machine Learning and Natural Language Processing (NLP). Think TREC, but open-source.
These materials require a lot of managing work and many human hours to be put into collecting corpora and topics, and then judging them. Without going into too many details here about the actual process, it essentially means crowd-sourcing a lot of work, and that is assuming the OpenRelevance project had the proper tools to offer the people recruited for the work.
Having no such tool, the Viewer - Orev - is meant for being exactly that, and so to minimize the overhead required from both the project managers and the people who will be doing the actual work. By providing nice and easy facilities to add new Topics and Corpora, and to feed documents into a corpus, it will make it very easy to manage the surrounding infrastructure. And with a nice web UI to be judging documents with, the work of the recruits is going to be very easy to grok.
More technical details
Orev is multi-lingual from the ground up, and is heavily user-based. Every user can view available topics and corpora, and make judgments based on the languages he speaks.
Managers can add new topics, create new corpora and feed those with documents. Documents can be added to a corpus, or updated, at a later time, too.
We will probably add the ability to enable users to send topics in as well and so on.
Even more technical details
When I started to work on this I was using NHibernate and spent some time on designing a DB schema, fighting with ASP.NET MVC and all that. Now that MVC 3 is out, and RavenDB is rocking worlds, it was a matter of a few hours to get this all started again from scratch. Using a schema-less DB really made this possible to do in a minimum number of hours, excluding some dilemmas and frustrations which I will be blogging about soon.
In the original design I intended on loading corpus documents from external sources, or store them on the file-system. Since now it is using RavenDB, which is a document based database, storing the documents in the DB itself now actually makes sense. This is how we can also offer later updating of a corpus with new documents, or patching old documents.
What's next
We need to run a lot of tests, get a lot of feedback and improve accordingly. The first step is obviously gathering content and raising interest, so if you find this post / project interesting - please spread the word.
Orev is currently using the default ASP.NET MVC theme. If there's any HTML5/CSS designer and magic worker who can take up the task to recreate it to be more inviting and easier to work with - it is something we can definitely use.
I have enabled the github bug tracker in the Orev source repository. Please use it for reporting bugs or asking for features.
When the dust sets down and actual judging will commence on a regular basis, we will start working on code to output stats and statistical computations, in preparations for the original cause of the OpenRelevance project - to measure performance of IR software (+ NLP + ML, of course), and to be able to produce bleeding edge analyzers for various languages.
Practical Hebrew search – Open2011 presentation
Attached with this post is the presentation I gave today at Open2011 in Tel-Aviv.
The sample app can be found here: http://hebmorph.code972.com/. It is also going to be HebMorph's home in a few weeks when I'll be done generating all the necessary content.
As promised, I will be posting more details on some interested findings on Hebrew search, and comparisons with Google search. I want to have a bit more comprehensive posts about that, so it will be up in a few weeks time.
Some words on HebMorph’s licensing
Without being a lawyer, and trying real hard not to become one, it is not easy to be an author of an open-source project. Apparently it takes quite a lot of thought, and definitely a lot of reading, to make sure the code you release has an appropriate license that specifies your intent correctly. If you don't pay enough attention, you probably are going to end up with a license that is not at all enforcing what you intended it to.
This is what happened to me with HebMorph, and this post is here to clarify everything that needs clarifying, and to explain the reasoning behind the recent license change to HebMorph.
Like I said in an e-mail conversation we recently held in HebMorph's mailing list, this project is all about research and sharing of information. We WILL reach our goals, some sooner than other, and when we do, the knowledge we gathered will be free for all to learn from and use. However, since we have a very long road ahead of us, I needed to make sure this project can support itself. I spent a lot of time researching options, charting a path, writing code, testing approaches and a lot more, and to be able to continue doing that in large bulks of time (and not occasionally) we needed income.
This is when I decided to charge for any commercial use made with code released under by the HebMorph project. It is actually pretty simple and very fair: I release my work for all to see and use without any charge. If, however, you make profit from my work, I'd like you to support the project. Aiming for quite a small market, relying on donations won't cut it, so I decided to use a license which will allow me to enforce that.
I explicitly stated more than once, and in more than one place, that I'm not after anyone's money. This project grew out of sheer interest, and it will definitely continue to evolve. This is why HebMorph doesn't have a price tag; if you want to use it in a commercial product, contact me and we'll figure something out. An arrangement that is fair for both parties.
Unaware of many legal details, I chose GPLv2 to be HebMorph's license. It seemed promising: any derivative work would require the consuming application to be released under GPLv2 as well, and since most companies would like to avoid that - they would pay for a commercial license. It also was the same license hspell is using, and since some parts of HebMorph are definitely a derivative work of hspell, it required HebMorph to be released under a compatible license, or GPLv2 itself. Problem solved - or at least so I thought.
Following a recent user inquiry, I found out my license of choice was in fact not suitable at all. First, it has many flaws and loopholes making it quite ineffective in enforcing what I wanted it to. It is practically the last license I would choose for any modern software; here's a good read on why.
Secondly, and not less important, any GPLv2 software is incompatible with Lucene/Solr, a software that is released under the Apache software license. Since our main platform is Lucene, we can't afford that.
Now that I realized all this, I've changed HebMorph's license to be AGPLv3. This license is based on GPLv3 (an improvement over GPLv2 on itself), but adds a paragraph that defines "use" in a way that covers also websites and webservices, and by that seals off the infamous GPLv2 loophole. Since AGPLv3 isn't compatible with GPLv2, I had to get an explicit permission from hspell's authors to still be able to use it, and such they did - with the exception of being able to use the hspell files distributed with HebMorph only for search purposes.
Now, you may notice how I frequently used the word "fair" when describing the license selection process. This is because I'm not here to run and seal loopholes, or make sure anyone that is making profit from my work is paying back. I enjoy doing other things, not that. I expect users to be fair; if they make profit from a product that uses HebMorph in one way or another, I expect them to be fair and give back. There probably could be thousands of ways to bypass any license, AGPL included, so I'm making it clear that I release HebMorph under the AGPL and also under the expectation of fairness. At some point I was actually considering using RPL, but then I decided it is too restrictive and will probably make more problems than it will solve. So I selected AGPLv3, and let me say this again: please act in good faith.
And just to make sure: as far as I'm concerned, using any HebMorph code through Solr is just the same as using it through Lucene. Solr is dynamically linking the jars in what falls under the very definition of "derivative work", and in case that was in doubt, it isn't now. I'm explicitly specifying this, so even if there is a loophole here (which I'm quite certain there is not), it is now under the license definition of "use": if your application uses Solr, and Solr uses HebMorph, your application is effectively using AGPLv3 software and need to be AGPLv3 as well.
Hopefully this clarifies some things about HebMorph, and as always I'd love to hear any thoughts on this.
Due to the unintended conflict of licenses, any previous versions of HebMorph being used with Lucene/Solr has to move to the new license.
As before, OSS projects and non-profit closed source projects are welcome to use HebMorph with no charge, but the latter should contact me in advance to discuss some terms.
So, what have I been up to lately?
To all of those who asked: NAppUpdate and HebMorph aren't dead.
The last few months have been quite hectic for me in several aspects, and this is mainly why I wasn't able to make any real progress with them, not to mention blogging. I still have big plans for both, and several other unrelated plans too. Things are becoming more relaxed now, so hopefully I could find time to work on all the ideas I have running in my head...
I recently joined Ayende's Hibernating Rhinos, and am currently working on RavenDB. So, it is inevitable some NoSQL/RavenDB ideas and posts will pop.
NAppUpdate is working very well for simple uses, which is exactly what I was intending to have when I first started working on it. However, judging by the stats I see there is quite a bit of interest in the tool, and in the features it is yet to offer. So I'm definitely planning on enhancing and improving it as time allows - feel free to jump in if you want to help. Work items I got planned include stabilizing the API and NauXML format, task groups, logging and reporting, UI and better handling of dependencies in tasks execution.
As for HebMorph, well, that is a project I'm deeply in love with, and as such it will be the first to get my attention. I'm getting a lot of positive feedback, but there's still so much work still left to do - even though what we have now is completely functional. I'll be giving a short talk on HebMorph and Hebrew search in Penguin Israel 2011 (location and date TBD), and until then I'm really hoping to get a few surprises ready. Stay tuned, I'll be blogging about them as I go.
HebMorph at SIGTRS 07/10
Today I gave a talk at SIGTRS on Hebrew search and HebMorph. Attached with this post is the slideshow from the presentation. More info on HebMorph is accessible through the project's page.
A PDF with the presentation summary in Hebrew is available as well (6 pages): HebMorph SIGTRS presentation summary. It describes what exactly HebMorph is, what problems it tries to solve, and how.
Wikipedia offline reader with Hebrew search support
BzReader (http://code.google.com/p/bzreader/) is a simple utility which allows browsing dump files downloaded from Wikipedia. Once downloaded, BzReader will go through all pages and articles in the dump file and index their titles. Using BzReader, it is easy to browse and search Wikipedia for specific topics, and once found a topic, to read it directly from the application. At the moment, the actual page contents aren't being indexed, only their titles.
I went ahead and forked the project, so I could add some extra functionalities more easily. For now I just updated the original code base to work with Lucene.Net 2.9.2 (the latest, instead of a very old version of it), and added better search support for Hebrew dumps with the help of HebMorph's Lucene.Net integration (see: code972.com/blog/hebmorph).
The updated code can be found here: http://github.com/synhershko/BzReader. Read the instructions there before compiling.
Here's a screenshot demonstrating how Hebrew searches were drastically improved after plugging HebMorph in. The search was for the Hebrew word "test" (noun). When used with StandardAnalyzer, only exact matches were found. When indexed and searched with HebMorph, also constructs and plurals of the word were found, for example "blood test" and "software tests":
Testing hspell’s language coverage using Wikipedia
As part of the HebMorph project, I needed to test hspell's dictionary on a large modern corpus. Knowing how many words it can recognize is very important, and below I'll be explaining exactly why.
The project, along with usage instructions, is released under the GNU GPL and available from here. The report (zipped XML) is available here.
More flexible Hebrew indexing with HebMorph
In the past week I've been working on making Hebrew indexing with HebMorph more flexible. Now it is possible to perform different type of searches, and also control the way lemmas are filtered. You can also perform exact searches and morphological searches on one field, without indexing the contents twice. See below for more details on how its done.
Open-source Hebrew information retrieval (HebMorph, part 3)
Indexing Hebrew texts for later retrieval is not a trivial task. Although several solutions exist, I have pointed out that they are not necessarily providing the best results. Either way, there is no freely available solution allowing to index Hebrew even at the very basic level.
HebMorph was started with this in mind. It is a free, open-source effort for making Hebrew properly searchable by various IR software libraries, while maintaining decent recall, precision and relevance in retrievals. During the work on this project, we will try and come up with different approaches to indexing Hebrew, and provide the tools to perform reliable comparisons between them. This project's ultimate goal is providing various IR libraries with the best Hebrew IR capabilities possible.
