Code972 Coding from the back of a camel

3Aug/121

xUnit.net, PropertyDataAttribute and derived classes

Apparently do not work well together...

Consider the following:

	public class Foo
	{
		public int AddsNumbers(int x, int y) { return x + y; }
	}

	public class FooTest
	{
		public static IEnumerable<object[]> Expectations
		{
			get
			{
				yield return new object[] { 1, 2, 3 };
			}
		}

		[Theory]
		[PropertyData("Expectations")]
		public void Adds_ReturnsExpectedValues(int x, int y, int expected)
		{
			Assert.Equal(expected, factory().AddsNumbers(x, y));
		}
	}

All will work well - the method Adds_ReturnsExpectedValues will be called with the parameters taken from the Expectations property as expected. But what if we wanted to have this FooTest an abstract base class, and have basic tests there, while declaring the Expectations property in a derived class?

This scenario makes sense when you have a good class hierarchy; you can then have many basic tests done on the general level at the abstract class, and specialized tests on the derived classes, while still controlling the parameters passed to them from the derived classes.

But when you try that, you discover it doesn't work with xUnit.net. xUnit will only take the PropertyData on the declaring type, e.g. the abstract class, and completely ignore the derived classes, even if it is them who actually triggered the test. You immediately start thinking about writing a method to manually run those tests, but that's quite of an headache when you have repeating tests, data properties, etc.

Luckily, this is an easy fix. I just took the sources for PropertyDataAttribute from the xUnit.net project, and changed it to the following (only the changed method is shown) and now it all works (giving the type triggered the test a priority):

		public override IEnumerable<object[]> GetData(MethodInfo methodUnderTest, Type[] parameterTypes)
		{
			Type type = PropertyType ?? methodUnderTest.ReflectedType;
			PropertyInfo propInfo = type.GetProperty(propertyName, BindingFlags.Public | BindingFlags.Static | BindingFlags.FlattenHierarchy);
			if (propInfo == null)
			{
				string typeName = type.FullName;
				if (methodUnderTest.DeclaringType != null)
				{
					propInfo = methodUnderTest.DeclaringType.GetProperty(propertyName,
					                                                     BindingFlags.Public | BindingFlags.Static |
					                                                     BindingFlags.FlattenHierarchy);
					typeName = "neither " + typeName + " nor " + methodUnderTest.DeclaringType.FullName;
				}

				if (propInfo == null)
					throw new ArgumentException(string.Format("Could not find public static property {0} on {1}", propertyName,
				                                          type.FullName));
			}

			object obj = propInfo.GetValue(null, null);
			if (obj == null)
				return null;

			IEnumerable<object[]> dataItems = obj as IEnumerable<object[]>;
			if (dataItems == null)
				throw new ArgumentException(string.Format("Property {0} on {1} did not return IEnumerable<object[]>", propertyName, type.FullName));

			return dataItems;
		}
Tagged as: , 1 Comment
6Jul/121

Hacking with RavenDB’s multi-maps

A couple of months ago I blogged about Orev - OpenRelevance viewer. The purpose of Orev, in short, is to create materials and a sandbox that allow to measure relevance between different full-text search methods.

In Orev we have Corpora, Topics and Judgments. A user is shown a Topic (= a few sentences describing something), and a Corpus Document, and he has to make a Judgment - whether the document is relevant to this Topic or not. By having a lot of judgments on a lot of corpora, using a lot of topics, we can perform automatic searches with different methods, and measure their relevance.

Orev was built using RavenDB as it's back-store, and in this post I'm going to show a nice approach we used to facilitate the judging process.

The Model

To start with, the model is a very simple one - we have Topic, User and Corpus, and of course we have a Judgment.

A Corpus has many CorpusDocuments, which are saved separately and not within the Corpus document itself. This is done for many reasons: they are different transactional units (if I update a typo in a document, the entire Corpus doesn't really change), and we want to retrieve one single document at a time when judging. Also, containing all documents within a parent Corpus document will bloat that document tremendously. So, for all intents and purposes, each CorpusDocument has to be stored as its own document.

And this is how they look:

	public class Corpus
	{
		public string Id { get; set; }

		[Required]
		public string Name { get; set; }

		[Required]
		public string Description { get; set; }

		[Required]
		[StringLength(5, MinimumLength = 5)]
		public string Language { get; set; } // a language identifier string, en-US for example
	}

	public class CorpusDocument
	{
		public string Id { get; set; }

		public string CorpusId { get; set; }
		public string Title { get; set; }
		public string Content { get; set; }
		public string InternalUniqueName { get; set; } // to allow us to track original name in the imported corpus
	}

	public class Topic
	{
		public string Id { get; set; }

		[Required]
		public string Title { get; set; }

		[Required]
		[DataType(DataType.MultilineText)]
		public string Description { get; set; }

		[Required]
		[DataType(DataType.MultilineText)]
		public string Narrator { get; set; }

		[Required]
		[StringLength(5, MinimumLength = 5)]
		public string Language { get; set; } // a language identifier string, en-US for example

		/// <summary>
		/// Id of user submitting this topic
		/// </summary>
		public string UserId { get; set; }
	}

	public class Judgment
	{
		public enum Verdict
		{
			Relevant,
			NotRelevant,
			Skip,
		};

		[Required]
		public string CorpusId { get; set; }

		[Required]
		public string DocumentId { get; set; }

		[Required]
		public string TopicId { get; set; }

		[Required]
		public string UserId { get; set; }

		[Required]
		public Verdict UserJudgement { get; set; }
	}

The Problem

We deployed the application, and imported  a lot of Topics, Corpora and CorpusDocuments. Now we want to start generating Judgments. So we let our user select the Corpus he wants to work on, and a Topic to judge CorpusDocuments against. But once we start the judgment process, how can we pull the next CorpusDocument? remember, we have to find one in the selected Corpus that hasn't been judged yet for the selected Topic.

Before jumping ahead to the solution, try to think how you would solve this yourself. Hint: it involves multi-maps.

The Solution

At first glance it seems the query we are going to issue is going to ask RavenDB questions about Judgments. More specifically, it is going to ask it for all Judgments that were not yet made for a specific CorpusDocument and Topic. But how can we query on documents that do not exist?

And then we realize that we are actually querying for a CorpusDocument: when judging, I don't care about other judgments, all I want is to get the next CorpusDocument to show to the user. Another realization is that if I look on all the Judgments made on a specific CorpusDocument, I can get a list of Topics it has been judged against, and perhaps work my way from that. If only I could consolidate both... hmm...

So this is where RavenDB's multi-maps come in. I select all Judgments with their Topic ID within an array, and all CorpusDocuments each with an empty array. This will result in one big set of rows, with each row containing the CorpusDocument ID (which is a document ID + the corpus ID) and one Topic ID there exists a Judgment for. The reason I'm selecting Topics as an array in this stage, is to comply with the format we will produce results in the Reduce step; RavenDB requires all Map and Reduce functions to have the same type of output.

It is important to note ALL corpus documents will be listed, but there may be corpus documents with no topics at all - they will be represented by one row with the CorpusDocument ID, and with an empty string as the Topic ID.

The next thing we want to do is to perform a Reduce step on that set of rows. Notice that if we group all the rows based on the CorpusDocument ID (which includes the Corpus ID), we can have a smaller set of rows, where a CorpusDocument is represented only once, and along with it all the Topic IDs there are Judgments for. So, if we previously had a lot of rows, each row with one CorpusDocument identifier and one Topic identifier, we now consolidated all the data we have for each CorpusDocument into one row per CorpusDocument. And this is exactly what we want to have.

Hence, we write this index:

	public class CorpusDocuments_ByNextUnrated : AbstractMultiMapIndexCreationTask<CorpusDocuments_ByNextUnrated.ReduceResult>
	{
		public class ReduceResult
		{
			public string DocumentId { get; set; }
			public string CorpusId { get; set; }
			public string[] Topics { get; set; }
		}

		public CorpusDocuments_ByNextUnrated()
		{
			AddMap<CorpusDocument>(docs => from corpusDoc in docs
										   select new { DocumentId = corpusDoc.Id, CorpusId = corpusDoc.CorpusId, Topics = new[] {string.Empty} }
										   );

			AddMap<Judgment>(judgments => from j in judgments
										  select new { DocumentId = j.DocumentId, j.CorpusId, Topics = new[] { j.TopicId } });

			Reduce = results => from result in results
								group result by new { result.DocumentId, result.CorpusId }
			                    into g
									select new
			                           	{
			                           		DocumentId = g.Key.DocumentId,
											CorpusId = g.Key.CorpusId,
			                           		Topics = g.SelectMany(x => x.Topics).Distinct().ToArray(),
			                           	};

			TransformResults = (db, results) => from result in results
			                                    let doc = db.Load<CorpusDocument>(result.DocumentId)
			                                    select doc;
		}
	}

Now we have an index which contains all the info we need: the CorpusDocument ID, the ID of the Corpus it belongs to, and the list of topics with judgments for each CorpusDocument, where all CorpusDocuments exist, even if they were never judged for any Topic. Performing the actual query is now just a matter of performing a match-all-docs-except query:

			var query = RavenSession.Advanced.LuceneQuery<CorpusDocument, CorpusDocuments_ByNextUnrated>()
				.Where("Topics:*") // match all docs
				.AndAlso()
				.WhereEquals("CorpusId", corpusId)
				.AndAlso()
				.Not
				.WhereEquals("Topics", topicId) // remove corpus docs with a particular TopicId attached to them
				.RandomOrdering()
				.FirstOrDefault();

This will issue a Lucene query like this: Topics:* AND CorpusId:corpus/1 AND -Topics:topics/1

This query will first match all index documents from the given corpus, and then will remove all CorpusDocuments which have a given TopicId attached to them. The way we built the index, if a CorpusDocument has a certain TopicId attached to it in the index, that means a Judgment has previously been made to it; and if a CorpusDocument has already been judged for our Topic, we are not interested in it anymore.

And just to spice things up a bit, I threw in RandomSorting().

29Jun/122

The philosophy behind NAppUpdate

About 2 years ago I was building a .NET desktop application, and needed an easy way to allow it to auto-update itself. I looked for libraries that do that, and all I could see was either complicated, commercial, or geared towards a very particular (usually common) use case. I didn't want anything of the sort, so this is how NAppUpdate was born.

NAppUpdate is a very lightweight library, taking no dependencies and doesn't use anything fancy (it actually runs on .NET 2.0). It was designed to be able to perform any update process you can think of, and do all the heavy lifting for you. Using very few simple API calls you can get your application to self-update from the web, local network, BitTorrent or whatever, and along the way perform DB schema updates, registry changes, additional installations and what not. Some functionality is supported out-of-the-box, and whatever is not - can be very easily added. With very few lines of code you can make it behave any way you want.

NAppUpdate is being used quite widely, but has nearly no documentation. I believe simple software doesn't really need docs, although some sort of explanation on how it works and what it is capable of doing is still important to have. This is what this post is about, and while at it I will discuss some of the key principle behind the design of the library. For future readers, please note all code samples work with version 0.2.

Getting started

To get started, you will need to reference the NAppUpdate DLL (only one DLL) from your project. Grab the latest release binaries or compile from source.

NAppUpdate is implemented as a singleton, and the public facing API is called UpdateManager. Once you got it referenced, all the operations will be available to you using UpdateManager.Instance. You don't need to initialize anything, it will just work.

NAppUpdate only needs to know where to get the data from, but that's about the only thing you will need to do to get started. This is done by simply providing NAppUpdate with an IUpdateSource implementation.

Bundled with NAppUpdate currently are basic implementations for getting data from the web using SimpleWebSource (HTTP, FTP, Proxies and all that stuff), from a UNC source, and an in-memory source. Common usage would look something like this - and you want to put it when your app starts:

UpdateManager.Instance.UpdateSource = new NAppUpdate.Framework.Sources.SimpleWebSource("http://mydomain.com/feed.xml"); // provided is the URL for the updates feed
UpdateManager.Instance.ReinstateIfRestarted(); // required to be able to restore state after app restart

Checking for updates

It's as simple as it gets:

                if (UpdateManager.Instance.CheckForUpdates())
                {
                    DialogResult dr = MessageBox.Show(
                        string.Format("Updates are available to your software ({0} total). Do you want to download and prepare them now? You can always do this at a later time.",
                        UpdateManager.Instance.UpdatesAvailable),
                        "Software updates available",
                         MessageBoxButtons.YesNo);

                    if (dr == DialogResult.Yes)
                    {
                        UpdateManager.Instance.PrepareUpdatesAsync(OnPrepareUpdatesCompleted);
                    }
                }
                else
                {
                    MessageBox.Show("Your software is up to date");
                }

You can do this in a blocking manner as shown above, or async. Since this will usually involve network traffic, it is recommended to have this running on a non-UI thread. If you don't have any spare thread handy, just call CheckForUpdatesAsync, it will do all the heavy lifting for you.

The common practice is to call CheckForUpdateAsync with a callback. In the call back you can handle the news of new updates as you see fit.

CheckForUpdates will retrieve the updates feed using the IUpdateSource implementation you provided originally (or you can pass it a new one), and will parse it to produce a list of update tasks. The feed is parsed using an IUpdateFeedReader implementation. You can roll your own, or use NauXml.

NauXml: Tasks and Conditions

Internally, NAppUpdate executes update tasks (concrete classes implementing IUpdateTask), and allows you to define conditions on them. Task without any conditions, or with trivial ones, will always execute.

Conditions are simply concrete classes implementing the interface IUpdateCondition. There is quite a handful of them built-in, like FileVersionCondition, FileChecksumCondition, OSCondition and many more. It is quite trivial to add any other condition as well.

To reflect that structure in the best way possible, NAppUpdate defines an XML schema we call NauXml. It is quite trivial to understand what's going on, and to write one yourself:

<?xml version="1.0" encoding="utf-8"?>
<Feed>
  <Tasks>
    <FileUpdateTask hotswap="true" updateTo="http://SomeSite.com/Files/NewVersion.dll" localPath="CurrentVersion.dll">
      <Description>Fixes a bug where versions should be odd numbers.</Description>
      <Conditions>
        <FileChecksumCondition checksumType="sha256" checksum="6B00EF281C30E6F2004B9C062345DF9ADB3C513710515EDD96F15483CA33D2E0" />
        <FileDateCondition type="or" what="is" timestamp="20091010T000000" />
      </Conditions>
    </FileUpdateTask>
  </Tasks>
</Feed>

NAppUpdate has an appropriate feed reader for the NauXml format built-in, obviously, and it is the default FeedReader implementation used. You can use any other format by handing NAU another IUpdateFeedReader implementation.

Preparing updates

So, you were notified of new updates, now what?

You could either notify the user of them and start preparing them if the user wishes to proceed (like in the example shown above), or prepare them silently and only notify the user when everything is ready to roll. It's completely up to you, just like the way you would be notifying them about the updates.

You can track the progress of the update preparation by subscribing to the UpdateManager.Instance.ReportProgress event. It will notify you of the general progress, and which task is currently preparing itself. There is a code sample showing exactly that in the github repository (available also within the download).

The preparation process is defined by doing all the lengthy process required for an update, without changing anything in your system. As such, it is completely safe to abort it, and no rolling back is required.

Applying updates

Once everything is prepared, all you have to do (probably after getting the user's consent) is call UpdateManager.Instance.ApplyUpdates(bool restartApplication). This will apply the updates.

Some update tasks might require a cold-update, meaning they cannot complete while the application is running. This is either by request in the feed, or the task tried updating while the application is working, failed, and fell back to requesting a cold update.

If no cold-updates are required, the update process will finish here. If there are any cold updates pending, the application will restart itself and apply them when it is off. You can ask NAppUpdate to bring the app back up after performing the update.

You also defer the update process to be performed when the user exists the application. This will ensure the process doesn't get in his/her way, and again, it is a matter of simply calling ApplyUpdates(false) on close.

Applying updates has to be called from the main UI thread if it may involve shutting down the application, this is to ensure right order, and that everything shuts down correctly.

Rolling back on failure

While preparing updates, or just before applying them, a rollback plan is prepared. In FileUpdateTask for example, the original file if exists is being copied to a backup location. Should anything go wrong, NAppUpdate will call the IUpdateTask Rollback() method of the failed task, and that will restore everything to normal.

Mind the language

Take note of the language I've been using while describing the update process. You are not necessarily "downloading" anything, nor "replacing files". You simply "Prepare" and "Apply", potentially in a cold manner.

This is a fundamental concept of NAppUpdate, and what makes it so strong. All common scenarios for performing application updates are already supported by the built-in funcionality; but it is all done using generic concepts, so any update process will fit.

Error handling

NAppUpdate tries hard to stick to the KISS principle, but being able to provide important indications on what's going on is very important. So when designing the API we went with a very simplistic approach, that allows for easy operation and well-defined behavior, while not compromising the ability to understand what's going on.

Each method you call will return true if successful, and false if anything has gone wrong. If you were using the Async version of a method, the callback function will be handed that boolean value. When a NAU method returns false, you should check UpdateManager.Instance.LatestError to see what went wrong. For your convenience, common errors are available in NAppUpdate.Framework.Common.Errors as consts.

NAppUpdate has a well-defined state at any given point, and it is available via UpdateManager.State. Since it doesn't make sense to CheckForUpdates() if you already have pending update tasks, NAU will disallow that. You should check the current state yourself before calling any NAU method if the update process is not managed automatically.

Feed generator

A long requested feature is an automatic feed generator, and there have been a few efforts to come up with a working one. Recently NAppUpdate got a contribution of a very nice feed generator, screenshot of it is shown below. It is a very useful too, but doesn't really scratch the surface of what such a generator can look like, thanks to the flexible structure of NAppUpdate.

In a future post I'll follow up on this and discuss how I envisioned such a generator tool. Since I hate doing UI, I will leave it as a challenge for whoever is interested...

Links

NAppUpdate on github - https://github.com/synhershko/NAppUpdate

Official mailing list - https://groups.google.com/group/nappupdate

More NAppUpdate content - http://www.code972.com/blog/tag/nappupdate/

24Jun/120

Google is not so smart after all

I mean, this one is trivial:


I'd rather have it show me search results than try to give me the answer I was looking for, unless it is very good at it. Like, Wolfram-Alpha good.

Filed under: English posts No Comments
21Jun/120

Single point of failure

September 1st, 1983. Korean Airlines flight 007 from New York City to Seoul disappeared a couple of hours after take-off. Only later was it discovered that the plane deviated from its original route; instead of flying through air corridor R-20, it entered Soviet airspace and was shot down by a Soviet interceptor. All 269 people on board were killed.

During an investigation conducted by the National Transportation Safety Board (NTSB) , it was made clear the plane was cruising way northern than it should have been. Instead of flying above international waters, the plane somehow entered Soviet airspace, enforcing them to gun it down, thinking it was a plane in a spying mission. How did the plane deviate that much from its assigned route? NTSB came up with two possible options, both pointed at human error.

The first option was typing the aerial waypoints incorrectly. These are latitude / longitude pairs the co-pilot enters and the captain validates, and they form the flight's route. Mistyping one digit may take the plane way off its planned route, possibly making it enter hostile territories. NTSB also mentioned another possibility - not turning the coordinates-based auto-pilot (INS) on, and instead flying with the Magnetic Heading auto-pilotmode. The Magnetic Heading option is always on during take-off, so it would require the pilot to remember to change the auto-pilot mode. If he failed to do so, the INS system would not use the coordinates they typed to guide the plane, since it would be off.

The captain on board of KAL flight 007 had years of flying experience. 10+ years in KAL, and many years before that in the air-force. Therefore, NTSB deemed the second option "less likely". They thought it is much more likely for typing a number incorrectly, and not caring to verify it, than it was for a very experienced pilot to flip a switch right after take-off. It is a switch you flip on every flight, after all.

Years later, after the Soviet Union fell apart and the investigation was able to conclude using the original black-box from the plane, the real reason for the deviation of the flight was discovered. It turns out the captain forgot to switch the INS system on, so the plane was cruising using the Magnetic Heading. Had he remembered to switch the INS system on in any point during the flight, he would have caught the error and redirect the plain to its assigned route, probably avoiding death.

In the software world we have a lot of slogans, methodologies and names for patterns. Single point of failure is not just a slogan. In this case, the system had many single points of failures, and it was only a matter of time before before it would have mortal consequences. I'm pretty sure this is not the only time the pilot forgot to switch to INS mode; it is the only time (that I know of) it caused death. Of an entire 747.

The Single Point of Failure in this case is not a system crash, or a bottleneck. It is about assuming the operator will always remember to do the right thing at the right time. And that is wrong, even if your user has 10+ years of flawless experience. I'm consciously avoiding the discussion on the poor UX of the auto-pilot system, and this is why I left some details relating to it out. Yes, you can get away from this using some UX tricks, like checklists or blinking signs or whatever, but then in the best scenario you are just making it less likely to happen, which is not good enough.

If it is the common practice to always first have magnetic heading mode turned on, and then switch to something else (not necessarily INS), then having it as a dedicated mode is a wrong assumption. But here I'm talking UX again, so we'll stop here.

When designing any software, not to mention complex systems, don't ever allow for a single point of failure, and don't ever assume it is only about preventing bottlenecks or crashes. In some systems you might save lives, but in most systems you'll just save yourself a lot of support calls.

You can read the full story, with all the details, in the Wikipedia page. National Geographic had a chapter on it in the excellent "Air Crash Investigation" series, which you can watch here. The image above is from that show.

15Jun/124

WMS: Rethinking our need for CMS

Traditionally, Content Management Systems are about Content. Whenever Data that is  not simply a content page was going to be persisted in a CMS, you'd somehow fit it in a Content entity, keeping the thought process always at "how would it look to the end user", never treating it as actual data.

Sure, there is the page-centric CMSes like DNN, and CMSes which try being data-centric. But when talking CMSes, you can either be too focused at solving your problem, in which case unless it is actual Content Pages that you are working with you will probably move away from CMSes anyway, or abstract too much to be a one-size-fits-all, which will always end up badly. There doesn't seem to be any middle ground.

This seems to be a problem with the roots of CMSes, the way they evolved, and with strong ties of it being always backed by a relational database. CMSes were always about "Content", not about "Data", and the relational model doesn't make your life any easier when you have actual data you want to work with (not just "display").

WMS: Redefining CMS

In today's reality, we might choose to continue bending terms and force all those types of systems under the definition of a CMS, sure, but this will most probably end-up in a multi-purpose data-to-ContentPage converter built on your MVC framework of choice, achieving nothing, really. Been there, done that.

Instead, I like to think of a WMS - Website Management System. I want this thing to help me deploy any website quickly and efficiently, allow me to manage my content AND DATA by providing admin screens, play nicely with my IDE, support multi-site nicely, and as a general rule won't stand in my way. This is mainly because I don't like building websites, I like building software that might be served to the world as websites. I don't like being a server administrator, either.

A perfect WMS would have a gallery of modules I could easily integrate, so I could deploy a website, any website, with no hassle, getting full-blown admin screens for each module that are actually fit for managing the data used by the modules. No stretching or rounding corners.

And whenever I'd want to add a new custom functionality to my website, I could just fire up my IDE, implement some interfaces, test, play with the new website locally, deploy, set 2-3 settings, and be done with it.

An important thing to note about a WMS is that it does NOT manage data for you, and also the concept of Content is gone forever. A ContentPages module might exist, but it is going to treat all content-pages as data, the module's data. A WMS is not about persistence, and the only data access it does is for persisting configurations. It might manage data-access for the modules, but only to an extent of being a middleman for serving data requests, requests being created by the module itself. This is a design decision we have to make to make sure we do not pose limits to what we can do with modules; data has to be persisted independently of the WMS, but optionally orchestrated by it.

Spec'ing WMS

So how would one go about implementing such a thing?

First, dropping SQL and using a schema-less database, with document database being probably the best fit here. Getting rid of a schema is the first real step for creating a real-world dynamic applications with good performance.

Then, moving from thinking in "Pages" and starting to think in terms of "Routes". Back in the days when we were displaying only content pages, it made sense to have a URL to a Page. With schema-less model, we have to make our website URLs "schema-less" as well - and this is where routing comes in. Each module will define it's own set of URL templates, and while configuring your website you will tell the WMS which Area, or URL prefix, this module is going to use. Simple modules like a ContentPages one will probably have one rule, and more complex ones like a products catalog or an eCommerce module will have all the flexibility they need.

And this brings me to the MVC pattern, which we probably are going to want to use. To use the MVC pattern correctly, each module will contain the Controllers, Models and Views it is going to use. A Controller must provide all the content for the View to consume, hence making all the data-access operations in the Controllers alone.

This is some high-level description of what I consider to be a fairly robust and flexible WMS, what in my opinion should replace what we know today as a CMS. I'll be writing a follow-up post with some discussion on other aspects which I consider important, namely Menus, Navigation and Hierarchies. But these are at some level just implementation details, which once you get the framework right, will need to fit in nicely.

22May/126

The future of geo-spatial searches with Lucene

Or: Introducing Spatial4n

The Lucene spatial contrib module has been a nice addition to Lucene, but for a while now too many bug reports have been piling up, and it got to a point where it was clear something was broken somewhere deep inside. Luckily, a bunch of good people started writing their own general purpose geo-spatial library in Java, and provided a Lucene module to interact with it to provide spatial search functionality. This project is called Spatial4j (formerly Lucene Spatial Playground), and it works great, solving all known issues with the previous implementation.

What's even more great about it is it was built from the ground up to support complex searches, like polygons and other custom shapes, as well as different search strategies. This is not just about a circle and a radius anymore. The guys that created it really dig geo-spatial searches, so it is probably going to get a lot better over time.

This library as well as it's accompanying Lucene module are now part of Lucene, and should be available to all when Lucene 4.0 is released.

Since RavenDB uses the old spatial module, we were getting quite a few bug reports ourselves, without really being able to do anything about it. So when we heard about this project, it was clear that we should be using it. And since it is written in Java, well - luckily this isn't the first piece of Java code I've been porting...

The Spatial4n library - the .NET version of Spatial4j - is available here: https://github.com/synhershko/Spatial4n

The Lucene part of things, sync'd with Lucene.NET's trunk, can be found here: https://github.com/synhershko/lucene.net/tree/spatial2trunk . It will be there until those are merged upstream. There is also a branch with the new spatial module that is compatible with the 2.9.4 API - https://github.com/synhershko/lucene.net/tree/spatial.

We had to do some custom coding to get it to work with all the functionality we wanted, but it was all doable and so far this library looks very promising. All it needs now is a bit more attention.

3May/120

RavenDB: The 2012 US tour

We have been working hard setting up a lot of RavenDB events in the US and Europe for the upcoming months, a full list of which can be found here.

This summer I'll be visiting the US to hold some of those events. Here is a list of the workshops I'll be delivering and user group meetings I'll attend this summer:

I'm also available for consulting work while visiting in NY, Chicago and London.

Tagged as: , No Comments
17Apr/123

ProTip: Don’t push binaries to source control, use nuget wisely instead

If you use nuget, then you probably are familiar with the packages folder that takes forever to update in your source control repository. Whenever you or someone in your team updates a package, committing and pushing it takes forever, and then everyone updating their local copies have to wait as well for it to download again.

In most scenarios there is absolutely no need for going through that torture. Instead of pushing binaries to your source control repository, you can just get those packages dynamically when building. Here are the steps to make this happen in an existing solution - it is even easier to do for a new one:

1. Copy Nuget.exe (you can get it from their site) to a \Tools folder under your solution

2. Go to Project -> ProjectName properies -> Build Events and add this pre-build action:

$(ProjectDir)..\Tools\Nuget.exe install $(ProjectDir)packages.config -o $(ProjectDir)..\Packages

3. Save and close VS. Delete the packages folder and commit.

4. Add the packages folder to gitignore. If you are not using git, start using it already.

5. Open the solution again, do a Clean Solution and build, see if nothing breaks. It shouldn't. Now push the change and relax.

Note this methods preserves the package version used, and when someone updates the version number all that is left to do is push a simple change to a text file and that's about it.

Slick.

Tagged as: , , 3 Comments
30Jan/122

RavenDB London tour

This February I'll be visiting London, consulting on RavenDB and giving our course at Skills Matter. There is also a free session, in which I'll discuss the RavenDB indexing system and how to make the most out of it.

More info on our RavenDB London course (Feb 28-29): http://skillsmatter.com/course/open-source-dot-net/ayende-rahiens-ravendb-workshop

The "In The Brain" session (Feb 28th, 18:30): http://skillsmatter.com/event/open-source-dot-net/ravendb

I might have a free evening there, and I'll be happy to discuss RavenDB or any other dev topic over a beer. Just saying.

Tagged as: 2 Comments