FastVectorHighlighter issues revisited

June 26th, 2011 English posts, Lucene, Lucene.Net, Lucene.Net.Contrib

4 min read

In a previous post I described how to use FVH to highlight contents which went through filters / readers like HTMLStripCharFilter in the analysis process. As DIGY in the comments spotted right away, my approach was all wrong. Yes, I knew any CharFilter or Tokenizer implementation would store term positions and offsets that take into account any skips done in the content, but since it didn't work for me I didn't care to look any deeper and just made that work around, and then ran to tell.

So, don't use that. Instead, rely on your analyzer to store positions and offsets and on FVH to use them correctly when highlighting. As it happens, the custom analyzers I used suffered from a nasty bug that was not allowing them to consider skips. Now that I fixed that, it all works like a charm.

However, two issues still remained. First, since my stored fields contain HTML, the fragments may contain HTML tags as well, sometimes partial ones. In many cases the fragment that will end up on your webpage would ruin the page layout because of a stubborn misplaced </div> tag that found its way to the fragment. Escaping all <'s and >'s is not a really good solution - you don't really want your fragments to contain ugly looking HTML tags.

The second issue was having duplicate content. I wanted to process the content more than once - index it with 2 or more analyzers, but didn't want to store it more than once since it was exactly the same content. To still be able to highlight on those other fields as well, I needed FVH to allow me to specify a field name to pull the stored contents from.

Solving the first problem was quite easy, and required nothing more than a simple extension function. It is called on the fragment string after receiving it from FVH. To be on the safe side, I made sure to ask for a larger fragment than I originally intended, so even if a lot of HTML noise is present, some context will remain in the fragment:

[code lang="csharp"] public static string HtmlStripFragment(this string fragment) { if (string.IsNullOrEmpty(fragment)) return string.Empty; var sb = new StringBuilder(fragment.Length); bool withinHtml = false, first = true; foreach (var c in fragment) { if (c == '>') { if (first) sb.Length = 0; withinHtml = false; first = false; continue; } if (withinHtml) continue; if (c == '<') { first = false; withinHtml = true; continue; } sb.Append(c); } // FVH was instantiated with "[b]" and "[/b]" as post- and pre- tags for highlighting, // so they won't get lost in translation return sb.Append("...").Replace("[b]", "<b>").Replace("[/b]", "</b>").ToString(); } [/code]

The second issue was solved by subclassing FragmentsBuilder, only this time it was a bit less intrusive:

[code lang="csharp"] public class CustomFragmentsBuilder : BaseFragmentsBuilder { public string ContentFieldName { get; protected set; } /// <summary> /// a constructor. /// </summary> public CustomFragmentsBuilder() { } public CustomFragmentsBuilder(string contentFieldName) { ContentFieldName = contentFieldName; } /// <summary> /// a constructor. /// </summary> /// <param name="preTags">array of pre-tags for markup terms</param> /// <param name="postTags">array of post-tags for markup terms</param> public CustomFragmentsBuilder(String[] preTags, String[] postTags) : base(preTags, postTags) { } public CustomFragmentsBuilder(string contentFieldName, String[] preTags, String[] postTags) : base(preTags, postTags) { ContentFieldName = contentFieldName; } /// <summary> /// do nothing. return the source list. /// </summary> public override List<WeightedFragInfo> GetWeightedFragInfoList(List<WeightedFragInfo> src) { return src; } protected override Field[] GetFields(IndexReader reader, int docId, string fieldName) { var field = ContentFieldName ?? fieldName; var doc = reader.Document(docId, new MapFieldSelector(new[] {field})); return doc.GetFields(field); // according to Document class javadoc, this never returns null } } [/code]

And as always the usual disclaimer applies - this isn't necessarily the best way to do this, and I'd definitely like to hear of more elegant ways to achieve that if such exist.

Can you please explain where I have to add those codes so I can use highlight for hebrew text?

I use standard Highlighter highlighter = new Highlighter(formatter, scorer); but is not good enough

if I seek for word בית , it highlights בית but not בבית... (but finds בבית)

Thanks for help!! :)

Yaniv July 9th, 2011

Code 972

FastVectorHighlighter issues revisited

Comments

Comments are now closed