Code972 Coding from the back of a camel

12Nov/120

Writing a software library and your day-to-day coding

When writing software, there is always this tension between writing good software fast and having good code structure, which usually results in a nice API. These two are completely orthogonal - you can definitely end up having one without the other, and it is really hard and time consuming to have both properly done.

Ultimately, one would go on a coding spree and write code as his fingers lead him. Starting with the stupidest code possible, even with Spaghetti Code, and slowly refactoring it to classes and interfaces and so on, as the need arises (if you're still not using a functional language, that is!). This is completely fine, and how it should be. The goal should always be to get things up and running fast, and optimize later.

Some people would call this approach short iterations, and say it's Agile. Me? I'm just up for what works. Work this way, and you'll find yourself spending more time on the places that actually matter, not wasting time on the unimportant bits, and delivering better applications faster.

When I say don't pay much attention to the API unless real need arises, what do I mean? Take method overloading as an example. Never provide several method overloads to simplify usage when you first write a method, as complex and with non-trivial parameters as it may be. Have one method signature and in your code create the required constructs. If you are starting to see a usage pattern which call for an overload, go ahead and create it if you don't have anything better to do.

This holds very true as long as we are talking internal projects written and maintained by you and your team. This renders almost completely false when we talk about creating a library or an API to be consumed by the public.

Take this Lucene code for example:

public static FSDirectory open(File path)

This method expects a File object. If you wanted to provide it with a string path, as you'd usually go when rapid-developing, you'd find yourself spending a few moments looking up the API for converting the string path you have to a File object. Admit that, nobody ever remembers those small bits unless this is what they do in their daily job.

And so it happens that this was exactly the case with Lucene.NET, being a strict port from Java. If you wanted to open a new FSDirectory, you would have to create a new DirectoryInfo object out of it, thinking for a moment if that should be made using a constructor which accepts string or perhaps via a static factory method.

And that is, ladies and gentlemen, bad library design. (And yes, I know I picked the smallest, stupidest bit possible).

Because if a common usage pattern is to open an FSDirectory using a string FS path, you should allow for that. If it was your own code, this may not have been a justified argument for spending time on API refactoring (although it might have been!), but once we talk public facing API - it becomes too important to dismiss. If only for all of those moments wasted by thousands of developers when wondering on how to translate a string path to a File / DirectoryInfo object, piling up to be wasted days or weeks in the software industry.

This was one of the first things I fixed in Lucene.NET after becoming a committer in the project, and starting with Lucene.NET 3.0.3 FSDirectory.Open has a public overload which accepts a string path.

6Jul/121

Hacking with RavenDB’s multi-maps

A couple of months ago I blogged about Orev - OpenRelevance viewer. The purpose of Orev, in short, is to create materials and a sandbox that allow to measure relevance between different full-text search methods.

In Orev we have Corpora, Topics and Judgments. A user is shown a Topic (= a few sentences describing something), and a Corpus Document, and he has to make a Judgment - whether the document is relevant to this Topic or not. By having a lot of judgments on a lot of corpora, using a lot of topics, we can perform automatic searches with different methods, and measure their relevance.

Orev was built using RavenDB as it's back-store, and in this post I'm going to show a nice approach we used to facilitate the judging process.

The Model

To start with, the model is a very simple one - we have Topic, User and Corpus, and of course we have a Judgment.

A Corpus has many CorpusDocuments, which are saved separately and not within the Corpus document itself. This is done for many reasons: they are different transactional units (if I update a typo in a document, the entire Corpus doesn't really change), and we want to retrieve one single document at a time when judging. Also, containing all documents within a parent Corpus document will bloat that document tremendously. So, for all intents and purposes, each CorpusDocument has to be stored as its own document.

And this is how they look:

	public class Corpus
	{
		public string Id { get; set; }

		[Required]
		public string Name { get; set; }

		[Required]
		public string Description { get; set; }

		[Required]
		[StringLength(5, MinimumLength = 5)]
		public string Language { get; set; } // a language identifier string, en-US for example
	}

	public class CorpusDocument
	{
		public string Id { get; set; }

		public string CorpusId { get; set; }
		public string Title { get; set; }
		public string Content { get; set; }
		public string InternalUniqueName { get; set; } // to allow us to track original name in the imported corpus
	}

	public class Topic
	{
		public string Id { get; set; }

		[Required]
		public string Title { get; set; }

		[Required]
		[DataType(DataType.MultilineText)]
		public string Description { get; set; }

		[Required]
		[DataType(DataType.MultilineText)]
		public string Narrator { get; set; }

		[Required]
		[StringLength(5, MinimumLength = 5)]
		public string Language { get; set; } // a language identifier string, en-US for example

		/// <summary>
		/// Id of user submitting this topic
		/// </summary>
		public string UserId { get; set; }
	}

	public class Judgment
	{
		public enum Verdict
		{
			Relevant,
			NotRelevant,
			Skip,
		};

		[Required]
		public string CorpusId { get; set; }

		[Required]
		public string DocumentId { get; set; }

		[Required]
		public string TopicId { get; set; }

		[Required]
		public string UserId { get; set; }

		[Required]
		public Verdict UserJudgement { get; set; }
	}

The Problem

We deployed the application, and imported  a lot of Topics, Corpora and CorpusDocuments. Now we want to start generating Judgments. So we let our user select the Corpus he wants to work on, and a Topic to judge CorpusDocuments against. But once we start the judgment process, how can we pull the next CorpusDocument? remember, we have to find one in the selected Corpus that hasn't been judged yet for the selected Topic.

Before jumping ahead to the solution, try to think how you would solve this yourself. Hint: it involves multi-maps.

The Solution

At first glance it seems the query we are going to issue is going to ask RavenDB questions about Judgments. More specifically, it is going to ask it for all Judgments that were not yet made for a specific CorpusDocument and Topic. But how can we query on documents that do not exist?

And then we realize that we are actually querying for a CorpusDocument: when judging, I don't care about other judgments, all I want is to get the next CorpusDocument to show to the user. Another realization is that if I look on all the Judgments made on a specific CorpusDocument, I can get a list of Topics it has been judged against, and perhaps work my way from that. If only I could consolidate both... hmm...

So this is where RavenDB's multi-maps come in. I select all Judgments with their Topic ID within an array, and all CorpusDocuments each with an empty array. This will result in one big set of rows, with each row containing the CorpusDocument ID (which is a document ID + the corpus ID) and one Topic ID there exists a Judgment for. The reason I'm selecting Topics as an array in this stage, is to comply with the format we will produce results in the Reduce step; RavenDB requires all Map and Reduce functions to have the same type of output.

It is important to note ALL corpus documents will be listed, but there may be corpus documents with no topics at all - they will be represented by one row with the CorpusDocument ID, and with an empty string as the Topic ID.

The next thing we want to do is to perform a Reduce step on that set of rows. Notice that if we group all the rows based on the CorpusDocument ID (which includes the Corpus ID), we can have a smaller set of rows, where a CorpusDocument is represented only once, and along with it all the Topic IDs there are Judgments for. So, if we previously had a lot of rows, each row with one CorpusDocument identifier and one Topic identifier, we now consolidated all the data we have for each CorpusDocument into one row per CorpusDocument. And this is exactly what we want to have.

Hence, we write this index:

	public class CorpusDocuments_ByNextUnrated : AbstractMultiMapIndexCreationTask<CorpusDocuments_ByNextUnrated.ReduceResult>
	{
		public class ReduceResult
		{
			public string DocumentId { get; set; }
			public string CorpusId { get; set; }
			public string[] Topics { get; set; }
		}

		public CorpusDocuments_ByNextUnrated()
		{
			AddMap<CorpusDocument>(docs => from corpusDoc in docs
										   select new { DocumentId = corpusDoc.Id, CorpusId = corpusDoc.CorpusId, Topics = new[] {string.Empty} }
										   );

			AddMap<Judgment>(judgments => from j in judgments
										  select new { DocumentId = j.DocumentId, j.CorpusId, Topics = new[] { j.TopicId } });

			Reduce = results => from result in results
								group result by new { result.DocumentId, result.CorpusId }
			                    into g
									select new
			                           	{
			                           		DocumentId = g.Key.DocumentId,
											CorpusId = g.Key.CorpusId,
			                           		Topics = g.SelectMany(x => x.Topics).Distinct().ToArray(),
			                           	};

			TransformResults = (db, results) => from result in results
			                                    let doc = db.Load<CorpusDocument>(result.DocumentId)
			                                    select doc;
		}
	}

Now we have an index which contains all the info we need: the CorpusDocument ID, the ID of the Corpus it belongs to, and the list of topics with judgments for each CorpusDocument, where all CorpusDocuments exist, even if they were never judged for any Topic. Performing the actual query is now just a matter of performing a match-all-docs-except query:

			var query = RavenSession.Advanced.LuceneQuery<CorpusDocument, CorpusDocuments_ByNextUnrated>()
				.Where("Topics:*") // match all docs
				.AndAlso()
				.WhereEquals("CorpusId", corpusId)
				.AndAlso()
				.Not
				.WhereEquals("Topics", topicId) // remove corpus docs with a particular TopicId attached to them
				.RandomOrdering()
				.FirstOrDefault();

This will issue a Lucene query like this: Topics:* AND CorpusId:corpus/1 AND -Topics:topics/1

This query will first match all index documents from the given corpus, and then will remove all CorpusDocuments which have a given TopicId attached to them. The way we built the index, if a CorpusDocument has a certain TopicId attached to it in the index, that means a Judgment has previously been made to it; and if a CorpusDocument has already been judged for our Topic, we are not interested in it anymore.

And just to spice things up a bit, I threw in RandomSorting().

22May/126

The future of geo-spatial searches with Lucene

Or: Introducing Spatial4n

The Lucene spatial contrib module has been a nice addition to Lucene, but for a while now too many bug reports have been piling up, and it got to a point where it was clear something was broken somewhere deep inside. Luckily, a bunch of good people started writing their own general purpose geo-spatial library in Java, and provided a Lucene module to interact with it to provide spatial search functionality. This project is called Spatial4j (formerly Lucene Spatial Playground), and it works great, solving all known issues with the previous implementation.

What's even more great about it is it was built from the ground up to support complex searches, like polygons and other custom shapes, as well as different search strategies. This is not just about a circle and a radius anymore. The guys that created it really dig geo-spatial searches, so it is probably going to get a lot better over time.

This library as well as it's accompanying Lucene module are now part of Lucene, and should be available to all when Lucene 4.0 is released.

Since RavenDB uses the old spatial module, we were getting quite a few bug reports ourselves, without really being able to do anything about it. So when we heard about this project, it was clear that we should be using it. And since it is written in Java, well - luckily this isn't the first piece of Java code I've been porting...

The Spatial4n library - the .NET version of Spatial4j - is available here: https://github.com/synhershko/Spatial4n

The Lucene part of things, sync'd with Lucene.NET's trunk, can be found here: https://github.com/synhershko/lucene.net/tree/spatial2trunk . It will be there until those are merged upstream. There is also a branch with the new spatial module that is compatible with the 2.9.4 API - https://github.com/synhershko/lucene.net/tree/spatial.

We had to do some custom coding to get it to work with all the functionality we wanted, but it was all doable and so far this library looks very promising. All it needs now is a bit more attention.

21Sep/110

Orev: The Apache OpenRelevance Viewer

It has been quite a some time since I said I'll be working on this, as I got caught on other pressing matters and had to drop it for a while. But it is all for the best. The technology I used for this new version is just a perfect fit for this application, and it wasn't available then. I'll be addressing the technical aspects later in this post and also in some follow-up posts.

My first interest in the OpenRelevance project, and one of the main reasons I created Orev, was the HebMorph project. Using Orev, I'm hoping to be able to create an environment where tools for Hebrew IR can be tested and compared, to produce the ultimate Hebrew analyzer, for Lucene and other libraries as well.

Before anything else, the complete source code is available at https://github.com/synhershko/Orev.

I have a hosted version too which I will publish a link to soon, once I get some things sorted out and some feedback from other people who were involved in this project.

What is this?

The OpenRelevance project is an Apache project, aimed at making materials for doing relevance testing for information retrieval (IR), Machine Learning and Natural Language Processing (NLP). Think TREC, but open-source.

These materials require a lot of managing work and many human hours to be put into collecting corpora and topics, and then judging them. Without going into too many details here about the actual process, it essentially means crowd-sourcing a lot of work, and that is assuming the OpenRelevance project had the proper tools to offer the people recruited for the work.

Having no such tool, the Viewer - Orev - is meant for being exactly that, and so to minimize the overhead required from both the project managers and the people who will be doing the actual work. By providing nice and easy facilities to add new Topics and Corpora, and to feed documents into a corpus, it will make it very easy to manage the surrounding infrastructure. And with a nice web UI to be judging documents with, the work of the recruits is going to be very easy to grok.

More technical details

Orev is multi-lingual from the ground up, and is heavily user-based. Every user can view available topics and corpora, and make judgments based on the languages he speaks.

Managers can add new topics, create new corpora and feed those with documents. Documents can be added to a corpus, or updated, at a later time, too.

We will probably add the ability to enable users to send topics in as well and so on.

Even more technical details

When I started to work on this I was using NHibernate and spent some time on designing a DB schema, fighting with ASP.NET MVC and all that. Now that MVC 3 is out, and RavenDB is rocking worlds, it was a matter of a few hours to get this all started again from scratch. Using a schema-less DB really made this possible to do in a minimum number of hours, excluding some dilemmas and frustrations which I will be blogging about soon.

In the original design I intended on loading corpus documents from external sources, or store them on the file-system. Since now it is using RavenDB, which is a document based database, storing the documents in the DB itself now actually makes sense. This is how we can also offer later updating of a corpus with new documents, or patching old documents.

What's next

We need to run a lot of tests, get a lot of feedback and improve accordingly. The first step is obviously gathering content and raising interest, so if you find this post / project interesting - please spread the word.

Orev is currently using the default ASP.NET MVC theme. If there's any HTML5/CSS designer and magic worker who can take up the task to recreate it to be more inviting and easier to work with - it is something we can definitely use.

I have enabled the github bug tracker in the Orev source repository. Please use it for reporting bugs or asking for features.

When the dust sets down and actual judging will commence on a regular basis, we will start working on code to output stats and statistical computations, in preparations for the original cause of the OpenRelevance project - to measure performance of IR software (+ NLP + ML, of course), and to be able to produce bleeding edge analyzers for various languages.

30Jun/112

Practical Hebrew search – Open2011 presentation

Attached with this post is the presentation I gave today at Open2011 in Tel-Aviv.

The sample app can be found here: http://hebmorph.code972.com/. It is also going to be HebMorph's home in a few weeks when I'll be done generating all the necessary content.

As promised, I will be posting more details on some interested findings on Hebrew search, and comparisons with Google search. I want to have a bit more comprehensive posts about that, so it will be up in a few weeks time.



26Jun/112

FastVectorHighlighter issues revisited

In a previous post I described how to use FVH to highlight contents which went through filters / readers like HTMLStripCharFilter in the analysis process. As DIGY in the comments spotted right away, my approach was all wrong. Yes, I knew any CharFilter or Tokenizer implementation would store term positions and offsets that take into account any skips done in the content, but since it didn't work for me I didn't care to look any deeper and just made that work around, and then ran to tell.

So, don't use that. Instead, rely on your analyzer to store positions and offsets and on FVH to use them correctly when highlighting. As it happens, the custom analyzers I used suffered from a nasty bug that was not allowing them to consider skips. Now that I fixed that, it all works like a charm.

However, two issues still remained. First, since my stored fields contain HTML, the fragments may contain HTML tags as well, sometimes partial ones. In many cases the fragment that will end up on your webpage would ruin the page layout because of a stubborn misplaced </div> tag that found its way to the fragment. Escaping all <'s and >'s is not a really good solution - you don't really want your fragments to contain ugly looking HTML tags.

The second issue was having duplicate content. I wanted to process the content more than once - index it with 2 or more analyzers, but didn't want to store it more than once since it was exactly the same content.  To still be able to highlight on those other fields as well, I needed FVH to allow me to specify a field name to pull the stored contents from.

Solving the first problem was quite easy, and required nothing more than a simple extension function. It is called on the fragment string after receiving it from FVH. To be on the safe side, I made sure to ask for a larger fragment than I originally intended, so even if a lot of HTML noise is present, some context will remain in the fragment:

public static string HtmlStripFragment(this string fragment)
{
	if (string.IsNullOrEmpty(fragment)) return string.Empty;

	var sb = new StringBuilder(fragment.Length);
	bool withinHtml = false, first = true;
	foreach (var c in fragment)
	{
		if (c == '>')
		{
			if (first) sb.Length = 0;
			withinHtml = false;
			first = false;
			continue;
		}
		if (withinHtml)
			continue;
		if (c == '<')
		{
			first = false;
			withinHtml = true;
			continue;
		}
		sb.Append(c);
	}

	// FVH was instantiated with "[b]" and "[/b]" as post- and pre- tags for highlighting,
	// so they won't get lost in translation
	return sb.Append("...").Replace("[b]", "<b>").Replace("[/b]", "</b>").ToString();
}

The second issue was solved by subclassing FragmentsBuilder, only this time it was a bit less intrusive:

public class CustomFragmentsBuilder : BaseFragmentsBuilder
{
	public string ContentFieldName { get; protected set; }

	/// <summary>
	/// a constructor.
	/// </summary>
	public CustomFragmentsBuilder()
	{
	}

	public CustomFragmentsBuilder(string contentFieldName)
	{
		ContentFieldName = contentFieldName;
	}

	/// <summary>
	/// a constructor.
	/// </summary>
	/// <param name="preTags">array of pre-tags for markup terms</param>
	/// <param name="postTags">array of post-tags for markup terms</param>
	public CustomFragmentsBuilder(String[] preTags, String[] postTags)
		: base(preTags, postTags)
	{
	}

	public CustomFragmentsBuilder(string contentFieldName, String[] preTags, String[] postTags)
		: base(preTags, postTags)
	{
		ContentFieldName = contentFieldName;
	}

	/// <summary>
	/// do nothing. return the source list.
	/// </summary>
	public override List<WeightedFragInfo> GetWeightedFragInfoList(List<WeightedFragInfo> src)
	{
		return src;
	}

	protected override Field[] GetFields(IndexReader reader, int docId, string fieldName)
	{
		var field = ContentFieldName ?? fieldName;
		var doc = reader.Document(docId, new MapFieldSelector(new[] {field}));
		return doc.GetFields(field); // according to Document class javadoc, this never returns null
	}
}

And as always the usual disclaimer applies - this isn't necessarily the best way to do this, and I'd definitely like to hear of more elegant ways to achieve that if such exist.

19Jun/117

Custom tokenization and Lucene’s FastVectorHighlighter

NOTE: The approach described below is wrong, you may want to read the follow-up post.

Perhaps you have tackled this before: you wanted to use Lucene's FastVectorHighlighter (aka FVH), but since you have a custom CharFilter in your analysis chain, the highlighter fails to produce valid fragments.

In my particular case, I used HTMLStripCharFilter (available to Lucene.Net through my pet contrib project) to extract text content from HTML pages, and then pass it through the rest of the analysis process. This confused FVH, since it was taking the full content from store, where HTML was still present, and token positions were not taking that into account. And any other custom CharFilter that is added to the analysis chain is going to cause the same troubles.

To overcome this, I needed to make sure FVH is aware of all content stripping operations that are made before or while tokenization is happening. All I had to do was to implement a custom FragmentsBuilder, looking as follows (.Net code; a Java version would look almost identical):

public class HtmlFragmentsBuilder : BaseFragmentsBuilder
{
	/// <summary>
	/// a constructor.
	/// </summary>
	public HtmlFragmentsBuilder()
		: base()
	{
	}

	/// <summary>
	/// a constructor.
	/// </summary>
	/// <param name="preTags">array of pre-tags for markup terms</param>
	/// <param name="postTags">array of post-tags for markup terms</param>
	public HtmlFragmentsBuilder(String[] preTags, String[] postTags)
		: base(preTags, postTags)
	{
	}

	/// <summary>
	/// do nothing. return the source list.
	/// </summary>
	public override List<WeightedFragInfo> GetWeightedFragInfoList(List<WeightedFragInfo> src)
	{
		return src;
	}

	protected override String GetFragmentSource(StringBuilder buffer, int[] index, Field[] values, int startOffset, int endOffset)
	{
		string fieldText;
		while (buffer.Length < endOffset && index[0] < values.Length)
		{
			fieldText = GetFilteredFieldText(values[index[0]]);
			if (index[0] > 0 && values[index[0]].IsTokenized() && fieldText.Length > 0)
				buffer.Append(' ');
			buffer.Append(fieldText);
			++(index[0]);
		}
		var eo = buffer.Length < endOffset ? buffer.Length : endOffset;
		return buffer.ToString().Substring(startOffset, eo - startOffset);
	}

	/// <summary>
	/// Gets the field text, after applying custom filtering
	/// </summary>
	/// <param name="field"></param>
	/// <returns></returns>
	protected string GetFilteredFieldText(Field field)
	{
		var theStream = new MemoryStream(Encoding.UTF8.GetBytes(field.StringValue()));
		var reader = CharReader.Get(new StreamReader(theStream));
		reader = new HTMLStripCharFilter(reader);

		int r;
		var sb = new StringBuilder();
		while ((r = reader.Read()) != -1)
		{
			sb.Append((char)r);
		}
		return sb.ToString();
	}
}

FVH will then need to be configured to use it:

var fvh = new FastVectorHighlighter(FastVectorHighlighter.DEFAULT_PHRASE_HIGHLIGHT,												FastVectorHighlighter.DEFAULT_FIELD_MATCH,
					new SimpleFragListBuilder(), new HtmlFragmentsBuilder());
// ...
var fq = fvh.GetFieldQuery(query);
var fragment = fvh.GetBestFragment(fq, searcher.GetIndexReader(), hits[i].doc, "Content", 300);

If you're using Lucene.Net, you'll have to make sure this patch is applied to your FVH before this could compile.

That was the easiest way to get this working, and fast. Perhaps I could make it more generic, or change the original implementation to allow that and submit it as a patch. Maybe I'll do it someday. Or you could...

16Jun/110

Announcing: Lucene.Net.Contrib

Whenever you start doing real-world stuff with Lucene you find yourself hacking and extending. That's the beauty of Lucene - it has so many extension points, and you can write almost every part of it from scratch to match your requirements.

Lately I've been working on some stuff relating to both RavenDB and HebMorph (separately...), and it became quite annoying keeping track of Lucene.Net extensions that are not part of the core project. In fact, several contrib packages (rather: projects) that are part of the original Lucene.Net project are hardly maintained and are not so friendly to use

So, I thought it was time to give all those a home. I created a new github repository called Lucene.Net.Contrib, where all those enhancements, large or small, should go. Once there's enough to go on, I'll create a nuget package and make it easily accessible.

Having a centralized location for all those has only benefits. Bugs can be found and fixed, a lot of time can be saved by just looking if someone has already ported or wrote stuff that you need, and the most important of all: finding new opportunities. Java Lucene has all that for quite some time now, and since I've been doing Lucene.Net a lot lately, I thought I'd give my small donation...

This is not trying to compete with Lucene.Net's contrib section, it is just intended in being much more flexible, fast growing community of extensions, most probably will be small in size.

What's currently there (not much - and only analysis/search related):

  • HTMLStripCharFilter - by plugging this to the analysis chain you can get any analyzer strip all HTML tags and take those positions into considerations (useful for later highlighting).
  • ReverseStringFilter - reverses a string; useful for cases where you need to allow leading wildcards and never trailing wildcards.
  • BinaryCoordSimilarity - Lucene Similarity configuration, which in a multi-word query scenario is punishing all results which do not contain ALL search terms.

Other stuff that is probably going to be included (or makes sense to):

All code is released under the same Apache license as Lucene and Lucene.Net's, unless otherwise specified (but only permissive licenses are allowed in).

Have you put your Lucene.Net extensions in yet? Fork away!

GitHub repo: https://github.com/synhershko/Lucene.Net.Contrib

18Jul/100

Wikipedia offline reader with Hebrew search support

BzReader (http://code.google.com/p/bzreader/) is a simple utility which allows browsing dump files downloaded from Wikipedia. Once downloaded, BzReader will go through all pages and articles in the dump file and index their titles. Using BzReader, it is easy to browse and search Wikipedia for specific topics, and once found a topic, to read it directly from the application. At the moment, the actual page contents aren't being indexed, only their titles.

I went ahead and forked the project, so I could add some extra functionalities more easily. For now I just updated the original code base to work with Lucene.Net 2.9.2 (the latest, instead of a very old version of it), and added better search support for Hebrew dumps with the help of HebMorph's Lucene.Net integration (see: code972.com/blog/hebmorph).

The updated code can be found here: http://github.com/synhershko/BzReader. Read the instructions there before compiling.

Here's a screenshot demonstrating how Hebrew searches were drastically improved after plugging HebMorph in. The search was for the Hebrew word "test" (noun). When used with StandardAnalyzer, only exact matches were found. When indexed and searched with HebMorph, also constructs and plurals of the word were found, for example "blood test" and "software tests":

Comparing Hebrew searches using StandardAnalyzer and HebMorph, via BzReader