The future of geo-spatial searches with Lucene
Or: Introducing Spatial4n
The Lucene spatial contrib module has been a nice addition to Lucene, but for a while now too many bug reports have been piling up, and it got to a point where it was clear something was broken somewhere deep inside. Luckily, a bunch of good people started writing their own general purpose geo-spatial library in Java, and provided a Lucene module to interact with it to provide spatial search functionality. This project is called Spatial4j (formerly Lucene Spatial Playground), and it works great, solving all known issues with the previous implementation.
What's even more great about it is it was built from the ground up to support complex searches, like polygons and other custom shapes, as well as different search strategies. This is not just about a circle and a radius anymore. The guys that created it really dig geo-spatial searches, so it is probably going to get a lot better over time.
This library as well as it's accompanying Lucene module are now part of Lucene, and should be available to all when Lucene 4.0 is released.
Since RavenDB uses the old spatial module, we were getting quite a few bug reports ourselves, without really being able to do anything about it. So when we heard about this project, it was clear that we should be using it. And since it is written in Java, well - luckily this isn't the first piece of Java code I've been porting...
The Spatial4n library - the .NET version of Spatial4j - is available here: https://github.com/synhershko/Spatial4n
The Lucene part of things, sync'd with Lucene.NET's trunk, can be found here: https://github.com/synhershko/lucene.net/tree/spatial2trunk . It will be there until those are merged upstream. There is also a branch with the new spatial module that is compatible with the 2.9.4 API - https://github.com/synhershko/lucene.net/tree/spatial.
We had to do some custom coding to get it to work with all the functionality we wanted, but it was all doable and so far this library looks very promising. All it needs now is a bit more attention.
FastVectorHighlighter issues revisited
In a previous post I described how to use FVH to highlight contents which went through filters / readers like HTMLStripCharFilter in the analysis process. As DIGY in the comments spotted right away, my approach was all wrong. Yes, I knew any CharFilter or Tokenizer implementation would store term positions and offsets that take into account any skips done in the content, but since it didn't work for me I didn't care to look any deeper and just made that work around, and then ran to tell.
So, don't use that. Instead, rely on your analyzer to store positions and offsets and on FVH to use them correctly when highlighting. As it happens, the custom analyzers I used suffered from a nasty bug that was not allowing them to consider skips. Now that I fixed that, it all works like a charm.
However, two issues still remained. First, since my stored fields contain HTML, the fragments may contain HTML tags as well, sometimes partial ones. In many cases the fragment that will end up on your webpage would ruin the page layout because of a stubborn misplaced </div> tag that found its way to the fragment. Escaping all <'s and >'s is not a really good solution - you don't really want your fragments to contain ugly looking HTML tags.
The second issue was having duplicate content. I wanted to process the content more than once - index it with 2 or more analyzers, but didn't want to store it more than once since it was exactly the same content. To still be able to highlight on those other fields as well, I needed FVH to allow me to specify a field name to pull the stored contents from.
Solving the first problem was quite easy, and required nothing more than a simple extension function. It is called on the fragment string after receiving it from FVH. To be on the safe side, I made sure to ask for a larger fragment than I originally intended, so even if a lot of HTML noise is present, some context will remain in the fragment:
public static string HtmlStripFragment(this string fragment)
{
if (string.IsNullOrEmpty(fragment)) return string.Empty;
var sb = new StringBuilder(fragment.Length);
bool withinHtml = false, first = true;
foreach (var c in fragment)
{
if (c == '>')
{
if (first) sb.Length = 0;
withinHtml = false;
first = false;
continue;
}
if (withinHtml)
continue;
if (c == '<')
{
first = false;
withinHtml = true;
continue;
}
sb.Append(c);
}
// FVH was instantiated with "[b]" and "[/b]" as post- and pre- tags for highlighting,
// so they won't get lost in translation
return sb.Append("...").Replace("[b]", "<b>").Replace("[/b]", "</b>").ToString();
}
The second issue was solved by subclassing FragmentsBuilder, only this time it was a bit less intrusive:
public class CustomFragmentsBuilder : BaseFragmentsBuilder
{
public string ContentFieldName { get; protected set; }
/// <summary>
/// a constructor.
/// </summary>
public CustomFragmentsBuilder()
{
}
public CustomFragmentsBuilder(string contentFieldName)
{
ContentFieldName = contentFieldName;
}
/// <summary>
/// a constructor.
/// </summary>
/// <param name="preTags">array of pre-tags for markup terms</param>
/// <param name="postTags">array of post-tags for markup terms</param>
public CustomFragmentsBuilder(String[] preTags, String[] postTags)
: base(preTags, postTags)
{
}
public CustomFragmentsBuilder(string contentFieldName, String[] preTags, String[] postTags)
: base(preTags, postTags)
{
ContentFieldName = contentFieldName;
}
/// <summary>
/// do nothing. return the source list.
/// </summary>
public override List<WeightedFragInfo> GetWeightedFragInfoList(List<WeightedFragInfo> src)
{
return src;
}
protected override Field[] GetFields(IndexReader reader, int docId, string fieldName)
{
var field = ContentFieldName ?? fieldName;
var doc = reader.Document(docId, new MapFieldSelector(new[] {field}));
return doc.GetFields(field); // according to Document class javadoc, this never returns null
}
}
And as always the usual disclaimer applies - this isn't necessarily the best way to do this, and I'd definitely like to hear of more elegant ways to achieve that if such exist.
Custom tokenization and Lucene’s FastVectorHighlighter
NOTE: The approach described below is wrong, you may want to read the follow-up post.
Perhaps you have tackled this before: you wanted to use Lucene's FastVectorHighlighter (aka FVH), but since you have a custom CharFilter in your analysis chain, the highlighter fails to produce valid fragments.
In my particular case, I used HTMLStripCharFilter (available to Lucene.Net through my pet contrib project) to extract text content from HTML pages, and then pass it through the rest of the analysis process. This confused FVH, since it was taking the full content from store, where HTML was still present, and token positions were not taking that into account. And any other custom CharFilter that is added to the analysis chain is going to cause the same troubles.
To overcome this, I needed to make sure FVH is aware of all content stripping operations that are made before or while tokenization is happening. All I had to do was to implement a custom FragmentsBuilder, looking as follows (.Net code; a Java version would look almost identical):
public class HtmlFragmentsBuilder : BaseFragmentsBuilder
{
/// <summary>
/// a constructor.
/// </summary>
public HtmlFragmentsBuilder()
: base()
{
}
/// <summary>
/// a constructor.
/// </summary>
/// <param name="preTags">array of pre-tags for markup terms</param>
/// <param name="postTags">array of post-tags for markup terms</param>
public HtmlFragmentsBuilder(String[] preTags, String[] postTags)
: base(preTags, postTags)
{
}
/// <summary>
/// do nothing. return the source list.
/// </summary>
public override List<WeightedFragInfo> GetWeightedFragInfoList(List<WeightedFragInfo> src)
{
return src;
}
protected override String GetFragmentSource(StringBuilder buffer, int[] index, Field[] values, int startOffset, int endOffset)
{
string fieldText;
while (buffer.Length < endOffset && index[0] < values.Length)
{
fieldText = GetFilteredFieldText(values[index[0]]);
if (index[0] > 0 && values[index[0]].IsTokenized() && fieldText.Length > 0)
buffer.Append(' ');
buffer.Append(fieldText);
++(index[0]);
}
var eo = buffer.Length < endOffset ? buffer.Length : endOffset;
return buffer.ToString().Substring(startOffset, eo - startOffset);
}
/// <summary>
/// Gets the field text, after applying custom filtering
/// </summary>
/// <param name="field"></param>
/// <returns></returns>
protected string GetFilteredFieldText(Field field)
{
var theStream = new MemoryStream(Encoding.UTF8.GetBytes(field.StringValue()));
var reader = CharReader.Get(new StreamReader(theStream));
reader = new HTMLStripCharFilter(reader);
int r;
var sb = new StringBuilder();
while ((r = reader.Read()) != -1)
{
sb.Append((char)r);
}
return sb.ToString();
}
}
FVH will then need to be configured to use it:
var fvh = new FastVectorHighlighter(FastVectorHighlighter.DEFAULT_PHRASE_HIGHLIGHT, FastVectorHighlighter.DEFAULT_FIELD_MATCH, new SimpleFragListBuilder(), new HtmlFragmentsBuilder()); // ... var fq = fvh.GetFieldQuery(query); var fragment = fvh.GetBestFragment(fq, searcher.GetIndexReader(), hits[i].doc, "Content", 300);
If you're using Lucene.Net, you'll have to make sure this patch is applied to your FVH before this could compile.
That was the easiest way to get this working, and fast. Perhaps I could make it more generic, or change the original implementation to allow that and submit it as a patch. Maybe I'll do it someday. Or you could...
Announcing: Lucene.Net.Contrib
Whenever you start doing real-world stuff with Lucene you find yourself hacking and extending. That's the beauty of Lucene - it has so many extension points, and you can write almost every part of it from scratch to match your requirements.
Lately I've been working on some stuff relating to both RavenDB and HebMorph (separately...), and it became quite annoying keeping track of Lucene.Net extensions that are not part of the core project. In fact, several contrib packages (rather: projects) that are part of the original Lucene.Net project are hardly maintained and are not so friendly to use
So, I thought it was time to give all those a home. I created a new github repository called Lucene.Net.Contrib, where all those enhancements, large or small, should go. Once there's enough to go on, I'll create a nuget package and make it easily accessible.
Having a centralized location for all those has only benefits. Bugs can be found and fixed, a lot of time can be saved by just looking if someone has already ported or wrote stuff that you need, and the most important of all: finding new opportunities. Java Lucene has all that for quite some time now, and since I've been doing Lucene.Net a lot lately, I thought I'd give my small donation...
This is not trying to compete with Lucene.Net's contrib section, it is just intended in being much more flexible, fast growing community of extensions, most probably will be small in size.
What's currently there (not much - and only analysis/search related):
- HTMLStripCharFilter - by plugging this to the analysis chain you can get any analyzer strip all HTML tags and take those positions into considerations (useful for later highlighting).
- ReverseStringFilter - reverses a string; useful for cases where you need to allow leading wildcards and never trailing wildcards.
- BinaryCoordSimilarity - Lucene Similarity configuration, which in a multi-word query scenario is punishing all results which do not contain ALL search terms.
Other stuff that is probably going to be included (or makes sense to):
- sciolist's Hyphenation package at https://github.com/sciolist/Lucene.Net.Analysis.Hyphenation.
- Support for Faceting, ported from Lucene/Solr (or more simplistic implementations of it).
- Windows specific performance tools (better performing readers, streams, and the like).
- Whatever else anyone will think is worth having...
All code is released under the same Apache license as Lucene and Lucene.Net's, unless otherwise specified (but only permissive licenses are allowed in).
Have you put your Lucene.Net extensions in yet? Fork away!
GitHub repo: https://github.com/synhershko/Lucene.Net.Contrib