Code972 Coding from the back of a camel

6Jul/121

Hacking with RavenDB’s multi-maps

A couple of months ago I blogged about Orev - OpenRelevance viewer. The purpose of Orev, in short, is to create materials and a sandbox that allow to measure relevance between different full-text search methods.

In Orev we have Corpora, Topics and Judgments. A user is shown a Topic (= a few sentences describing something), and a Corpus Document, and he has to make a Judgment - whether the document is relevant to this Topic or not. By having a lot of judgments on a lot of corpora, using a lot of topics, we can perform automatic searches with different methods, and measure their relevance.

Orev was built using RavenDB as it's back-store, and in this post I'm going to show a nice approach we used to facilitate the judging process.

The Model

To start with, the model is a very simple one - we have Topic, User and Corpus, and of course we have a Judgment.

A Corpus has many CorpusDocuments, which are saved separately and not within the Corpus document itself. This is done for many reasons: they are different transactional units (if I update a typo in a document, the entire Corpus doesn't really change), and we want to retrieve one single document at a time when judging. Also, containing all documents within a parent Corpus document will bloat that document tremendously. So, for all intents and purposes, each CorpusDocument has to be stored as its own document.

And this is how they look:

	public class Corpus
	{
		public string Id { get; set; }

		[Required]
		public string Name { get; set; }

		[Required]
		public string Description { get; set; }

		[Required]
		[StringLength(5, MinimumLength = 5)]
		public string Language { get; set; } // a language identifier string, en-US for example
	}

	public class CorpusDocument
	{
		public string Id { get; set; }

		public string CorpusId { get; set; }
		public string Title { get; set; }
		public string Content { get; set; }
		public string InternalUniqueName { get; set; } // to allow us to track original name in the imported corpus
	}

	public class Topic
	{
		public string Id { get; set; }

		[Required]
		public string Title { get; set; }

		[Required]
		[DataType(DataType.MultilineText)]
		public string Description { get; set; }

		[Required]
		[DataType(DataType.MultilineText)]
		public string Narrator { get; set; }

		[Required]
		[StringLength(5, MinimumLength = 5)]
		public string Language { get; set; } // a language identifier string, en-US for example

		/// <summary>
		/// Id of user submitting this topic
		/// </summary>
		public string UserId { get; set; }
	}

	public class Judgment
	{
		public enum Verdict
		{
			Relevant,
			NotRelevant,
			Skip,
		};

		[Required]
		public string CorpusId { get; set; }

		[Required]
		public string DocumentId { get; set; }

		[Required]
		public string TopicId { get; set; }

		[Required]
		public string UserId { get; set; }

		[Required]
		public Verdict UserJudgement { get; set; }
	}

The Problem

We deployed the application, and imported  a lot of Topics, Corpora and CorpusDocuments. Now we want to start generating Judgments. So we let our user select the Corpus he wants to work on, and a Topic to judge CorpusDocuments against. But once we start the judgment process, how can we pull the next CorpusDocument? remember, we have to find one in the selected Corpus that hasn't been judged yet for the selected Topic.

Before jumping ahead to the solution, try to think how you would solve this yourself. Hint: it involves multi-maps.

The Solution

At first glance it seems the query we are going to issue is going to ask RavenDB questions about Judgments. More specifically, it is going to ask it for all Judgments that were not yet made for a specific CorpusDocument and Topic. But how can we query on documents that do not exist?

And then we realize that we are actually querying for a CorpusDocument: when judging, I don't care about other judgments, all I want is to get the next CorpusDocument to show to the user. Another realization is that if I look on all the Judgments made on a specific CorpusDocument, I can get a list of Topics it has been judged against, and perhaps work my way from that. If only I could consolidate both... hmm...

So this is where RavenDB's multi-maps come in. I select all Judgments with their Topic ID within an array, and all CorpusDocuments each with an empty array. This will result in one big set of rows, with each row containing the CorpusDocument ID (which is a document ID + the corpus ID) and one Topic ID there exists a Judgment for. The reason I'm selecting Topics as an array in this stage, is to comply with the format we will produce results in the Reduce step; RavenDB requires all Map and Reduce functions to have the same type of output.

It is important to note ALL corpus documents will be listed, but there may be corpus documents with no topics at all - they will be represented by one row with the CorpusDocument ID, and with an empty string as the Topic ID.

The next thing we want to do is to perform a Reduce step on that set of rows. Notice that if we group all the rows based on the CorpusDocument ID (which includes the Corpus ID), we can have a smaller set of rows, where a CorpusDocument is represented only once, and along with it all the Topic IDs there are Judgments for. So, if we previously had a lot of rows, each row with one CorpusDocument identifier and one Topic identifier, we now consolidated all the data we have for each CorpusDocument into one row per CorpusDocument. And this is exactly what we want to have.

Hence, we write this index:

	public class CorpusDocuments_ByNextUnrated : AbstractMultiMapIndexCreationTask<CorpusDocuments_ByNextUnrated.ReduceResult>
	{
		public class ReduceResult
		{
			public string DocumentId { get; set; }
			public string CorpusId { get; set; }
			public string[] Topics { get; set; }
		}

		public CorpusDocuments_ByNextUnrated()
		{
			AddMap<CorpusDocument>(docs => from corpusDoc in docs
										   select new { DocumentId = corpusDoc.Id, CorpusId = corpusDoc.CorpusId, Topics = new[] {string.Empty} }
										   );

			AddMap<Judgment>(judgments => from j in judgments
										  select new { DocumentId = j.DocumentId, j.CorpusId, Topics = new[] { j.TopicId } });

			Reduce = results => from result in results
								group result by new { result.DocumentId, result.CorpusId }
			                    into g
									select new
			                           	{
			                           		DocumentId = g.Key.DocumentId,
											CorpusId = g.Key.CorpusId,
			                           		Topics = g.SelectMany(x => x.Topics).Distinct().ToArray(),
			                           	};

			TransformResults = (db, results) => from result in results
			                                    let doc = db.Load<CorpusDocument>(result.DocumentId)
			                                    select doc;
		}
	}

Now we have an index which contains all the info we need: the CorpusDocument ID, the ID of the Corpus it belongs to, and the list of topics with judgments for each CorpusDocument, where all CorpusDocuments exist, even if they were never judged for any Topic. Performing the actual query is now just a matter of performing a match-all-docs-except query:

			var query = RavenSession.Advanced.LuceneQuery<CorpusDocument, CorpusDocuments_ByNextUnrated>()
				.Where("Topics:*") // match all docs
				.AndAlso()
				.WhereEquals("CorpusId", corpusId)
				.AndAlso()
				.Not
				.WhereEquals("Topics", topicId) // remove corpus docs with a particular TopicId attached to them
				.RandomOrdering()
				.FirstOrDefault();

This will issue a Lucene query like this: Topics:* AND CorpusId:corpus/1 AND -Topics:topics/1

This query will first match all index documents from the given corpus, and then will remove all CorpusDocuments which have a given TopicId attached to them. The way we built the index, if a CorpusDocument has a certain TopicId attached to it in the index, that means a Judgment has previously been made to it; and if a CorpusDocument has already been judged for our Topic, we are not interested in it anymore.

And just to spice things up a bit, I threw in RandomSorting().

21Sep/110

Orev: The Apache OpenRelevance Viewer

It has been quite a some time since I said I'll be working on this, as I got caught on other pressing matters and had to drop it for a while. But it is all for the best. The technology I used for this new version is just a perfect fit for this application, and it wasn't available then. I'll be addressing the technical aspects later in this post and also in some follow-up posts.

My first interest in the OpenRelevance project, and one of the main reasons I created Orev, was the HebMorph project. Using Orev, I'm hoping to be able to create an environment where tools for Hebrew IR can be tested and compared, to produce the ultimate Hebrew analyzer, for Lucene and other libraries as well.

Before anything else, the complete source code is available at https://github.com/synhershko/Orev.

I have a hosted version too which I will publish a link to soon, once I get some things sorted out and some feedback from other people who were involved in this project.

What is this?

The OpenRelevance project is an Apache project, aimed at making materials for doing relevance testing for information retrieval (IR), Machine Learning and Natural Language Processing (NLP). Think TREC, but open-source.

These materials require a lot of managing work and many human hours to be put into collecting corpora and topics, and then judging them. Without going into too many details here about the actual process, it essentially means crowd-sourcing a lot of work, and that is assuming the OpenRelevance project had the proper tools to offer the people recruited for the work.

Having no such tool, the Viewer - Orev - is meant for being exactly that, and so to minimize the overhead required from both the project managers and the people who will be doing the actual work. By providing nice and easy facilities to add new Topics and Corpora, and to feed documents into a corpus, it will make it very easy to manage the surrounding infrastructure. And with a nice web UI to be judging documents with, the work of the recruits is going to be very easy to grok.

More technical details

Orev is multi-lingual from the ground up, and is heavily user-based. Every user can view available topics and corpora, and make judgments based on the languages he speaks.

Managers can add new topics, create new corpora and feed those with documents. Documents can be added to a corpus, or updated, at a later time, too.

We will probably add the ability to enable users to send topics in as well and so on.

Even more technical details

When I started to work on this I was using NHibernate and spent some time on designing a DB schema, fighting with ASP.NET MVC and all that. Now that MVC 3 is out, and RavenDB is rocking worlds, it was a matter of a few hours to get this all started again from scratch. Using a schema-less DB really made this possible to do in a minimum number of hours, excluding some dilemmas and frustrations which I will be blogging about soon.

In the original design I intended on loading corpus documents from external sources, or store them on the file-system. Since now it is using RavenDB, which is a document based database, storing the documents in the DB itself now actually makes sense. This is how we can also offer later updating of a corpus with new documents, or patching old documents.

What's next

We need to run a lot of tests, get a lot of feedback and improve accordingly. The first step is obviously gathering content and raising interest, so if you find this post / project interesting - please spread the word.

Orev is currently using the default ASP.NET MVC theme. If there's any HTML5/CSS designer and magic worker who can take up the task to recreate it to be more inviting and easier to work with - it is something we can definitely use.

I have enabled the github bug tracker in the Orev source repository. Please use it for reporting bugs or asking for features.

When the dust sets down and actual judging will commence on a regular basis, we will start working on code to output stats and statistical computations, in preparations for the original cause of the OpenRelevance project - to measure performance of IR software (+ NLP + ML, of course), and to be able to produce bleeding edge analyzers for various languages.