Code972 Coding from the back of a camel

9May/130

August consulting / training opportunities

This August I'm flying around a bit and have some availability for on-site training and consulting in various places around the world.

If you are interested in improving your team's skills on RavenDB, Lucene or ElasticSearch or need an extra set of eyes to help you solve a problem, I'll be happy to jump in for a day or two and help.

I'll be in the following cities / areas:

  • Barcelona, Spain
  • London, UK and surroundings
  • New-York City and surroundings
  • Toronto, Canada

Contact me via e-mail (itamar at this domain) for more details.

3May/130

Hebrew search with ElasticSearch using HebMorph

Hebrew search is not an easy task, and HebMorph is a project I started several years ago to address that problem. After a certain period of inactivity I'm back actively working on it. I'm also happy to say there are already several live systems using it to enable Hebrew searches in their applications.

This post is a short step-by-step guide on how to use HebMorph in an ElasticSearch installation. There are quite a few configuration options and things to consider when enabling Hebrew search, most are in the realm of performance vs relevance trade-offs, but I'll talk about those in a separate post.

0. What exactly is HebMorph

HebMorph is a project a bit wider than just providing a Hebrew search plugin for ElasticSearch, but for the purpose of this post let us treat it in that narrow aspect.

HebMorph has 3 main parts - the hspell dictionary files, the hebmorph-core package which is a wrapper around the dictionary files with important bits that allow for locating words even if they weren't written exactly as they appear in the dictionary, and the hebmorph-lucene package which contains various tools for processing streams of text into Lucene tokens - the searchable parts.

To enable Hebrew search from ElasticSearch we are going to need to use the Hebrew analyzer class HebMorph provides to analyze incoming Hebrew texts. That is done by providing ElasticSearch with the HebMorph packages and then telling it to use the Hebrew analyzer on text fields as needed.

1. Get HebMorph and hspell

At the moment you will have to compile HebMorph from sources yourself using Maven. In the future we might upload it to a centralized repository, but since we still actively working on a lot of stuff there it is still a bit too early for that.

Probably the easiest way to get HebMorph is to do git clone from the main repository. The repository is located at https://github.com/synhershko/HebMorph and includes the latest hspell files already under /hspell-data-files. If you are new to git GitHub offers great tutorials for getting started with it, and they also enable you to download the entire source tree as a zip or a tarball.

Once you have the sources, run mvn package or mvn install to create 2 jars - hebmorph-core and hebmorph-lucene. Those 2 packages are required before moving on to the next step.

2. Create an ElasticSearch plugin

In this step we will create a new plugin which we will use in the next step to create the Hebrew analyzers in. If you already have a plugin you wish to use, skip to the next step.

ElasticSearch plugins are compiled Java packages you simply drop to the plugins folder of your ElasticSearch installation and it gets detected automatically by the ElasticSearch instance once it is initialized. If you are new to this, you might want to read up a bit on that in the official ElasticSearch documentation. Here is a great guide to start with: http://jfarrell.github.io/

The gist of this is having a Java project with a es-plugin.properties file embedded as a resource and pointing to class that tells ElasticSearch what classes to load as plugins, and their plugin type. In the next section we will use this to add our own Analyzer implementation which makes use of HebMorph's capabilities.

3. Creating an Hebrew Analyzer

HebMorph already comes with MorphAnalyzer - an Analyzer implementation which takes care of Hebrew-aware tokenization, lemmatization and whatnot. Because it is highly configurable, personally I prefer re-implementing it in the ElasticSearch plugin so it is easier to change the configurations in code. In case you wondered, I'm not planning in supporting external configurations for this as it is too subtle and you should really know what you are doing there.

Don't forget to add dependencies to hebmorph-core and hebmorph-lucene to your project.

My common Analyzer setup for Hebrew search looks like this:

public abstract class HebrewAnalyzer extends ReusableAnalyzerBase {

    protected enum AnalyzerType {
        INDEXING, QUERY, EXACT
    }

    private static final DictRadix<Integer> prefixesTree = LingInfo.buildPrefixTree(false);
    private static DictRadix<MorphData> dictRadix;
    private final StreamLemmatizer lemmatizer;
    private final LemmaFilterBase lemmaFilter;

    protected final Version matchVersion;
    protected final AnalyzerType analyzerType;
    protected final char originalTermSuffix = '$';

    static {
        try {
            dictRadix = Loader.loadDictionaryFromHSpellData(new File(resourcesPath + "hspell-data-files"), true);
        } catch (IOException e) {
            // TODO log
        }
    }

    protected HebrewAnalyzer(final AnalyzerType analyzerType) throws IOException {
        this.matchVersion = matchVersion;
        this.analyzerType = analyzerType;
        lemmatizer = new StreamLemmatizer(null, dictRadix, prefixesTree, null);
        lemmaFilter = new BasicLemmaFilter();
    }

    @Override
    protected TokenStreamComponents createComponents(final String fieldName, final Reader reader) {
        // on query - if marked as keyword don't keep origin, else only lemmatized (don't suffix)
        // if word termintates with $ will output word$, else will output all lemmas or word$ if OOV
        if (analyzerType == AnalyzerType.QUERY) {
            final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter);
            src.setAlwaysSaveMarkedOriginal(true);
            src.setSuffixForExactMatch(originalTermSuffix);

            TokenStream tok = new SuffixKeywordFilter(src, '$');
            return new TokenStreamComponents(src, tok);
        }

        if (analyzerType == AnalyzerType.EXACT) {
            // on exact - we don't care about suffixes at all, we always output original word with suffix only
            final HebrewTokenizer src = new HebrewTokenizer(reader, prefixesTree, null);
            TokenStream tok = new NiqqudFilter(src);
            tok = new LowerCaseFilter(matchVersion, tok);
            tok = new AlwaysAddSuffixFilter(tok, '$', false);
            return new TokenStreamComponents(src, tok);
        }

        // on indexing we should always keep both the stem and marked original word
        // will ignore $ && will always output all lemmas + origin word$
        // basically, if analyzerType == AnalyzerType.INDEXING)
        final StreamLemmasFilter src = new StreamLemmasFilter(reader, lemmatizer, null, lemmaFilter);
        src.setAlwaysSaveMarkedOriginal(true);

        TokenStream tok = new SuffixKeywordFilter(src, '$');
        return new TokenStreamComponents(src, tok);
    }

    public static class HebrewIndexingAnalyzer extends HebrewAnalyzer {
        public HebrewIndexingAnalyzer() throws IOException {
            super(AnalyzerType.INDEXING);
        }
    }

    public static class HebrewQueryAnalyzer extends HebrewAnalyzer {
        public HebrewQueryAnalyzer() throws IOException {
            super(AnalyzerType.QUERY);
        }
    }

    public static class HebrewExactAnalyzer extends HebrewAnalyzer {
        public HebrewExactAnalyzer() throws IOException {
            super(AnalyzerType.EXACT);
        }
    }
}

You may notice how I created 3 separate analyzers - one for indexing, one for querying and the last for exact querying. I'll be talking more about this in future posts, but the idea is to be able to provide flexibility on querying while still allow for correct indexing.

Configuring the analyzers to be picked up from ElasticSearch is rather easy now. First, you need to wrap each analyzer in a "provider", like so:

public class HebrewQueryAnalyzerProvider extends AbstractIndexAnalyzerProvider<HebrewAnalyzer.HebrewQueryAnalyzer> {
private final HebrewAnalyzer.HebrewQueryAnalyzer hebrewAnalyzer;

@Inject
public HebrewQueryAnalyzerProvider(Index index, @IndexSettings Settings indexSettings, Environment env, @Assisted String name, @Assisted Settings settings) throws IOException {
super(index, indexSettings, name, settings);
hebrewAnalyzer = new HebrewAnalyzer.HebrewQueryAnalyzer();
}

@Override
public HebrewAnalyzer.HebrewQueryAnalyzer get() {
return hebrewAnalyzer;
}
}

After you've created such providers for all types of analyzers, create an AnalysisBinderProcessor like this (or update your existing one with definitions for the Hebrew analyzers):

public class MyAnalysisBinderProcessor extends AnalysisModule.AnalysisBinderProcessor {

    private final static HashMap<String, Class<? extends AnalyzerProvider>> languageAnalyzers = new HashMap<>();
    static {
        languageAnalyzers.put("hebrew", HebrewIndexingAnalyzerProvider.class);
        languageAnalyzers.put("hebrew_query", HebrewQueryAnalyzerProvider.class);
        languageAnalyzers.put("hebrew_exact", HebrewExactAnalyzerProvider.class);
    }

    public static boolean analyzerExists(final String analyzerName) {
        return languageAnalyzers.containsKey(analyzerName);
    }

    @Override
    public void processAnalyzers(final AnalyzersBindings analyzersBindings) {
        for (Map.Entry<String, Class<? extends AnalyzerProvider>> entry : languageAnalyzers.entrySet()) {
            analyzersBindings.processAnalyzer(entry.getKey(), entry.getValue());
        }
    }
}

Don't forget to update your Plugin class to catch the AnalysisBinderProcessor - it should look something like this (plus any other stuff you want to add there):

public class MyPlugin extends AbstractPlugin {
    @Override
    public String name() {
        return "my-plugin";
    }

    @Override
    public String description() {
        return "Implements custom actions required by me";
    }

    @Override
    public void processModule(Module module) {
        if (module instanceof AnalysisModule) {
            ((AnalysisModule)module).addProcessor(new MyAnalysisBinderProcessor());
        }
    }

}

4. Using the Hebrew analyzers

Compile the ElasticSearch plugin and drop it along with its dependencies in a folder under the /plugins folder of ElasticSearch. You now have 3 new types of analyzers at your disposal: "hebrew", "hebrew_query" and "hebrew_exact".

For indexing, you want to use the "hebrew" analyzer. In your mapping, you can define a certain field or an entire set of fields to use that specific analyzer by setting the analyzer for that field. You can also leave the analyzer configuration blank, and specify the analyzer to use for those fields with unspecified analyzer using the _analyzer field in the index request. See more about both here and here.

The "hebrew" analyzer will expand each term to all recognized lemmas; in case the word wasn't recognized it will try to tolerate spelling errors or missing Yud/Vav - most of the time it will be successful (with some rate of false positives, which the lemma-filters should remove to some degree). Some words will still remain unrecognized and thus will be indexed as-is.

When querying using a QueryString query you can specify what analyzer to use - use the "hebrew_query" or "hebrew_exact" analyzer. The former will perform lemma expansion similar to the indexing analyzer, and the latter will avoid that and allow you to perform exact matches (useful when searching for names or exact phrases).

I pretty much ignored a lot of the complexity involved in fine tuning searches for Hebrew, and many very cool things HebMorph allows you to do with Hebrew search for the sake of focus. I will revisit them in a later blog post.

5. Administration

The hspell dictionary files are looked up by a physical location on disk - you will need to provide a path they are saved at. Since dictionaries update, it is sometimes easier to update them that way in a distributed environment like the one I'm working with. It may be desirable to have them compiled within the same jar file as the code itself - I'll be happy to accept a pull request to do that.

The code above is working with ElasticSearch 0.90 GA and Lucene 4.2.1. I also had it running on earlier versions of both technologies, but may had to make a few minor changes. I assume the samples would break on future versions and I'll probably don't have much time going back and keeping it up to date, but bear in mind most of the time the changes are minor and easy to understand and make by yourself.

Both HebMorph and the hspell dictionary are released under the AGPL3. For any questions on licensing, feel free to contact me.

15Mar/130

The story of a massive system refactoring for the cloud: prologue

A couple of months ago I started working at Buzzilla, a company developing "cutting edge technologies and revolutionary analysis and research methodologies that combine to create advanced solutions aimed at harnessing the vast opportunities presented by online conversation". Or in short, full-text search and analytics on BigData.

When I started there the existing system was based on a home-brewed solution for distributing a Lucene index across dozens of nodes. Since we are looking to expand way beyond the number of a couple of dozen servers, we really needed to recreate the system using tools better suited for the job, which we could also take to the cloud. And it is much more than just about the search engine.

Refactoring an operational distributed system is really a great challenge. While keeping it operational, you need to replace parts one by one, but you also get the opportunity to experiment with new tools and bleeding edge technologies. It is also a great subject for a series of blog posts, this one being the first of many to come. Among the items we had to tackle on which I'll blog are:

  • Building a distributed and highly available search engine
  • Various topics on full-text search: relevance, scoring, multi-lingual search, best practices for analysis and more
  • Choosing the right web framework for a web UI
  • Migrating from MySQL to NoSQL - and selecting the right NoSQL for the job
  • Caching
  • Keeping tabs of logs in a distributed environment
  • Planning for and recovering from failures and crashes
  • CI, deployment, versioning and backups
  • Identifying and fixing performance bottlenecks
  • Generating and displaying system stats and performance metrics

Such a refactoring process could easily turn out to be a disaster. This is the story of a careful planning and a great team working together, which made this a success. We are now in the final stages of migrating to the new system, and we have many more challenges pending after we have gone completely live. This series is going to span months and hopefully have some great content which will spark great discussions.

14Jan/130

NancyFx: Live editable views with RavenDB

When building a website with NancyFX by default views are loaded from the file system - pretty much like with all MVC-based websites. While NancyFX also supports loading views embedded in assemblies as resources , both options require re-deploying of actual files when something in the view needs updating. Even with fully CI environments, that is still sort of a PITA.

Here is how to use RavenDB to override views in a live website without re-deploying anything. Basically what this does, thanks to NancyFX's modular and flexible design, is take the default ViewLocationProvider and encapsulate it, reading all the views from the original location (file system or assembly resources), and give precedence to views loaded from RavenDB.

The code featured here makes 2 assumptions:

1. A document name convention for view documents is preserved - basically some prefix (for example "MyWebsite/") used to prevent polluting the document store and then the full view name (location + name + extension). When loading all available views, we use a filter to make sure we load only views with a supported extension (determined by the installee view-engines). A view-template document in RavenDB will then have an ID similar to "WebsiteViews/Views/Home/Read.cshtml".

2. There are less than 1024 views stored to RavenDB. This is probably safe to assume, or you have some monstrous website.

There's one bit missing here - view-cache invalidation. By default NancyFx will cache all views it loaded indefinitely, as far as I can tell. That's bad for us, because when you update a template in your RavenDB store you do want to invalidate all caches, or at least one specific template you updated. This is something I'll keep for another post.

This is the custom ViewLocationProvider class:

using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using NSemble.Core.Models;
using Nancy;
using Nancy.ViewEngines;

namespace NSemble.Core.Nancy
{
	public class RavenViewLocationProvider : IViewLocationProvider
	{
		private readonly IViewLocationProvider defaultViewLocationProvider;

		public RavenViewLocationProvider(IRootPathProvider rootPathProvider)
		{
			defaultViewLocationProvider = new FileSystemViewLocationProvider(rootPathProvider);
		}

		public RavenViewLocationProvider(IRootPathProvider rootPathProvider, IFileSystemReader fileSystemReader)
		{
			defaultViewLocationProvider = new FileSystemViewLocationProvider(rootPathProvider, fileSystemReader);
		}

		public IEnumerable<ViewLocationResult> GetLocatedViews(IEnumerable<string> supportedViewExtensions)
		{
			var sb = new StringBuilder();

			// Make sure to only load saved views with supported extensions
			foreach (var s in supportedViewExtensions)
			{
				if (sb.Length > 0)
					sb.Append("|");
				sb.Append("*.");
				sb.Append(s);
			}

			ViewTemplate[] views = null;
			using (var session = NSembleModule.DocumentStore.OpenSession())
			{
				// It's probably safe to assume we will have no more than 1024 views, so no reason to bother with paging
				views = session.Advanced.LoadStartingWith<ViewTemplate>(Constants.RavenViewDocumentPrefix, sb.ToString(), 0, 1024);
			}

			// Read the views from the default location
			IEnumerable<ViewLocationResult> defaultViews = defaultViewLocationProvider.GetLocatedViews(supportedViewExtensions);
			if (views.Length == 0)
				return defaultViews;

			var ret = new HashSet<ViewLocationResult>(from v in views
								  where supportedViewExtensions.Contains(v.Extension)
			                                          select new ViewLocationResult(
				                                          v.Location,
				                                          v.Name,
				                                          v.Extension,
				                                          () => new StringReader(v.Contents)));

			foreach (var v in defaultViews)
				ret.Add(v);

			return ret;
		}
	}
}

You will need to register it in your Nancy Boostrapper class:

        protected override NancyInternalConfiguration InternalConfiguration
        {
            get
            {
  		return NancyInternalConfiguration
			.WithOverrides(x => x.ViewLocationProvider = typeof (RavenViewLocationProvider))
                	.WithIgnoredAssembly(asm => asm.FullName.StartsWith("RavenDB", StringComparison.InvariantCulture)); // or override ConfigureApplicationContainer to set AutoRegister to false
            }
        }
12Jan/130

Geo-spatial search with RavenDB 2.0

I recently wrote an article on geo-spatial search with RavenDB for "The Developer", a Norwegian magazine for developers. Seeing how questions on this topic keep coming in, I thought it would be a good idea to post it here as a reference for anyone interested.

So, there you go. Happy searching.

Tagged as: , No Comments
28Nov/123

Started work on a “RavenDB in Action” book

I recently signed a contract with Manning for writing a "RavenDB in Action" book. Writing this book comes naturally after delivering many RavenDB workshops, talks, training sessions and consultancy gigs. The structure of the book and the actual content in it are based on actual experience of explaining RavenDB to people of different levels. Also many examples I'll use are going to be taken from real-world scenarios.

This book is aimed at .NET developers with actual programming experience, and I'm working hard to make sure no RDBMS background will be required nor it would not confuse SQL savants. The book is designed to help RavenDB newbies to go from zero to hero. Experienced Ravenees would definitely benefit from reading it as well, as it is going to cover many advanced topics in great detail.

We are already in the middle of writing the first 4 chapters, and hopefully they'll be available on MEAP in two or three months. I'll make sure to announce once it is released...

12Nov/120

Writing a software library and your day-to-day coding

When writing software, there is always this tension between writing good software fast and having good code structure, which usually results in a nice API. These two are completely orthogonal - you can definitely end up having one without the other, and it is really hard and time consuming to have both properly done.

Ultimately, one would go on a coding spree and write code as his fingers lead him. Starting with the stupidest code possible, even with Spaghetti Code, and slowly refactoring it to classes and interfaces and so on, as the need arises (if you're still not using a functional language, that is!). This is completely fine, and how it should be. The goal should always be to get things up and running fast, and optimize later.

Some people would call this approach short iterations, and say it's Agile. Me? I'm just up for what works. Work this way, and you'll find yourself spending more time on the places that actually matter, not wasting time on the unimportant bits, and delivering better applications faster.

When I say don't pay much attention to the API unless real need arises, what do I mean? Take method overloading as an example. Never provide several method overloads to simplify usage when you first write a method, as complex and with non-trivial parameters as it may be. Have one method signature and in your code create the required constructs. If you are starting to see a usage pattern which call for an overload, go ahead and create it if you don't have anything better to do.

This holds very true as long as we are talking internal projects written and maintained by you and your team. This renders almost completely false when we talk about creating a library or an API to be consumed by the public.

Take this Lucene code for example:

public static FSDirectory open(File path)

This method expects a File object. If you wanted to provide it with a string path, as you'd usually go when rapid-developing, you'd find yourself spending a few moments looking up the API for converting the string path you have to a File object. Admit that, nobody ever remembers those small bits unless this is what they do in their daily job.

And so it happens that this was exactly the case with Lucene.NET, being a strict port from Java. If you wanted to open a new FSDirectory, you would have to create a new DirectoryInfo object out of it, thinking for a moment if that should be made using a constructor which accepts string or perhaps via a static factory method.

And that is, ladies and gentlemen, bad library design. (And yes, I know I picked the smallest, stupidest bit possible).

Because if a common usage pattern is to open an FSDirectory using a string FS path, you should allow for that. If it was your own code, this may not have been a justified argument for spending time on API refactoring (although it might have been!), but once we talk public facing API - it becomes too important to dismiss. If only for all of those moments wasted by thousands of developers when wondering on how to translate a string path to a File / DirectoryInfo object, piling up to be wasted days or weeks in the software industry.

This was one of the first things I fixed in Lucene.NET after becoming a committer in the project, and starting with Lucene.NET 3.0.3 FSDirectory.Open has a public overload which accepts a string path.

1Oct/120

RavenDB Consultancy & Training

I've been working as a core developer for RavenDB for quite a while, writing core features, providing support to users and customers, and co-authored and delivered the official 2-day RavenDB Workshop. Starting October (today...) I'm offering on-site and remote RavenDB consultancy services, as an independent RavenDB consultant. I'm also available for on-site RavenDB training worldwide (1, 2 or 3 day courses).

I can be contacted either by email (itamar at this domain) or Skype (itamarsyn).

Tagged as: , No Comments
24Aug/125

Geo-spatial searches with RavenDB

For quite a while RavenDB had geo-spatial search capabilities, but ever since it was introduced it was limited to finding documents with latitude and longitude within a radius from a given point. In the past few weeks I was working on revamping the Lucene.Net spatial module, and earlier this week the work on that was complete. Next in line was getting those changes into RavenDB. I just finished doing that, and this post is going to show what it can do, and how.

First, a few words on geo-spatial indexes. To be able to represent a shape in an index, and then search for it, shapes are converted to an index-friendly representation. There are quite a few ways to do this, most commonly known approaches are prefix trees and bounding-box. The QuadPrefixTree approach, for example, represents the earth with 4 grid squares at it's first level of precision. The squares are labeled A, B, C and D. The next level of precision introduces another letter to the representation, so we get 16 grid squares - AA, AB, AC, AD, BA, ... and so on. By having this multiple layers of precision, we can create the most efficient representation of a shape which balances number of terms vs precision. Another implementation called GeohashPrefixTree uses geohashes which have more grid squares per layer.

Before diving any deeper, here's how you would perform a simple point and radius spatial search. This is taken directly from the old API (which we revised a bit), and since it's easier to use for the most common usage of geo-spatial searches, we left it mostly intact:

	// The spatial index
	public class LegacySpatialIndex : AbstractIndexCreationTask<Event>
	{
		public LegacySpatialIndex()
		{
			Map = docs => from doc in docs
						  select new
						  {
							  doc.Title,
							  _ = SpatialGenerate(doc.lat, doc.lng)
						  };
		}
	}

	// The querying method
	public IEnumerable<Event> GetEventsLegacy()
	{
		IEnumerable<Event> events;
		using (var session = store.OpenSession())
		{
			events = session.Query<Event>()
				.Customize(x => x.WithinRadiusOf(10, 32.456236, 54.234053))
				.ToList();
		}
		return events;
	}

The new spatial stuff is quite powerful, and we really wanted to keep all that power in your hands. Therefore, when defining an index you get a chance to specify which spatial strategy and what prefix tree "height" to use. You can just use the defaults if you wish to, of course.

Shapes in both documents and queries are represented using WKT - a markup language for representing shapes, so they are as human readable as they can possibly be. Using WKT also frees everyone from hard to use API and tons of classes, at least as long as the shapes you use are simple enough. If you are expecting to handle complex shapes, it is recommended that you install NetTopologySuite from nuget to help you with creating shapes and serializing them to their WKT string representation.

Here is an example of the new capabilities. Please note, I just pushed the code for that in, so the API might change a bit by the time you get to play with it:

	public class Event
	{
		public string Title { get; set; }

		// WKT representation of a point on earth, ex. POINT (24.532341 54.352753)</pre>
		public string Location { get; set; }
	}

	public class SpatialIndex : AbstractIndexCreationTask<Event>
	{
		public SpatialIndex()
		{
			Map = docs => from doc in docs
						select new
						 {
							doc.Title,
							_ = SpatialGenerate(fieldName: "Location", shapeWKT: doc.Location,
									strategy: SpatialSearchStrategy.GeohashPrefixTree, maxTreeLevel: 12)
						 };
		}
	}

	public IEnumerable<Event> GetEvents()
	{
		IEnumerable<Event> events;
		using (var session = store.OpenSession())
		{
			events = session.Query<Event>()
				.Customize(x => x.RelatesToShape(fieldName: "Location",
								shapeWKT: "Circle(32.454898 53.234012 d=6.000000)", SpatialRelation.Within))
				.ToList();

		}
		return events;
	}

This is the unbound version of the API, and you can do quite about anything with it. A few notes about this new API:

  1. The SpatialGenerate() method in the index definition is expecting a WKT formatted string. It can be any shape you want, but it has to be a legal shape string.
  2. Specifying a spatial strategy is done when defining the index. Changing a strategy will trigger re-indexing.
  3. The strategy and maxTreeLevels parameters are completely optional. Only use them if you know what you are doing, otherwise, stick to the defaults.
  4. You can provide ANY shape while querying, and an expected relation to it. More details on shape relations below.
  5. The results will be sorted by distance, unless otherwise requested.
  6. You can store several shapes in one documents, and specify which shape it is you want to query on, using the fieldName argument in both the index definition and the query. However, at this point you can execute a query only against one spatial field at a time (but as many non-spatial fields as you want).

Obviously, one of the benefits of this new implementation is the ability to index any shape, and to issue a query with any shape against them. Circles, points, squares, polygons - RavenDB doesn't care anymore.

There are 3 types of shape relationships that are supported with this new implementation:

  1. Intersects - querying for a shape which intersects a shape stored in a document within RavenDB will find those shapes which intersect with the given shape. Intersection occurs when the two shapes have at least one shared grid hash. Because of current limitations of the algorithm, very large indexed shapes are not deemed to intersect with very small query shapes. However, smaller indexed shapes will intersect with larger query shapes.
  2. Disjoint - Finds those indexed shapes which are disjoint to the query shape. This means the the indexed shapes and query shape must have no shared grid hashes.
  3. Within / Contains - Finds those indexed shapes which are fully contained within the query shape. Unlike intersects, this means that all of the indexed shape must be present in the query shape. Any shapes which have additional area outside of the query shape are excluded.

Limitations and gotchas:

  • Distances with this new implementation are Kilometers, while the old implementation was using Miles. Since this is what the internal implementation uses, and it is hardly exposed to the end user, we kept using the metric system. It is quite easy to convert this back to miles, and if there will be demand we might introduce a configuration option on the server side to do that.
  • Handling of polygons which cross the dateline isn't supported at this stage.
  • Multi-polygon support is lacking.

This new feature is really neat, and opens up great new opportunities with its simplicity and ease of use. It is available to us thanks to the spatial4j project, and powered by Lucene.Net, Spatial4n and NetTopologySuite.

Tagged as: , 5 Comments
21Aug/122

Leaving Hibernating Rhinos

After working for a while as a core developer for RavenDB, it is time for me to move on. Starting September, I will no longer be with Hibernating Rhinos working, supporting and training on RavenDB as my day job.

I love RavenDB. It is a great product, and I'm sure it will get very far. The design decisions behind it make it be real art. The way it breaks old bad habits we developers have and the design it enforces us to use, all make you build overall better apps. Finally a database that actually helps you do your job without any real compromise.

However, from now on, RavenDB for me will be an open-source project which I'm involved in. As time permits, I will continue to hang out in the mailing list and provide support, perhaps even adding features or fixing bugs from time to time.

I will also continue providing training and consulting, on-site and remote. My next scheduled trip is to London during the first 2 weeks of September. I still have a few open slots for UK-based companies, so feel free to shoot me an e-mail. I will also deliver this free talk on RavenDB at SkillsMatter on September 12th while there, and will be happy to take any RavenDB related questions after it or over a beer (good beer and conversation always buys me off..).

My next gig is with Buzzilla, an Israeli-based company which builds software that can track, monitor and analyze on-line conversation in social networks and the general Internet. A lot of cutting edge stuff is going on there, involving Machine Learning, NLP and search engines, so that's pretty exciting. I'm going to be leading a development team building a new search engine platform, tackling problems like distributed search and multi-lingual content, and working with BigData, which is always fun. Great times ahead!

Tagged as: 2 Comments