August consulting / training opportunities
This August I'm flying around a bit and have some availability for on-site training and consulting in various places around the world.
If you are interested in improving your team's skills on RavenDB, Lucene or ElasticSearch or need an extra set of eyes to help you solve a problem, I'll be happy to jump in for a day or two and help.
I'll be in the following cities / areas:
- Barcelona, Spain
- London, UK and surroundings
- New-York City and surroundings
- Toronto, Canada
Contact me via e-mail (itamar at this domain) for more details.
NancyFx: Live editable views with RavenDB
When building a website with NancyFX by default views are loaded from the file system - pretty much like with all MVC-based websites. While NancyFX also supports loading views embedded in assemblies as resources , both options require re-deploying of actual files when something in the view needs updating. Even with fully CI environments, that is still sort of a PITA.
Here is how to use RavenDB to override views in a live website without re-deploying anything. Basically what this does, thanks to NancyFX's modular and flexible design, is take the default ViewLocationProvider and encapsulate it, reading all the views from the original location (file system or assembly resources), and give precedence to views loaded from RavenDB.
The code featured here makes 2 assumptions:
1. A document name convention for view documents is preserved - basically some prefix (for example "MyWebsite/") used to prevent polluting the document store and then the full view name (location + name + extension). When loading all available views, we use a filter to make sure we load only views with a supported extension (determined by the installee view-engines). A view-template document in RavenDB will then have an ID similar to "WebsiteViews/Views/Home/Read.cshtml".
2. There are less than 1024 views stored to RavenDB. This is probably safe to assume, or you have some monstrous website.
There's one bit missing here - view-cache invalidation. By default NancyFx will cache all views it loaded indefinitely, as far as I can tell. That's bad for us, because when you update a template in your RavenDB store you do want to invalidate all caches, or at least one specific template you updated. This is something I'll keep for another post.
This is the custom ViewLocationProvider class:
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using NSemble.Core.Models;
using Nancy;
using Nancy.ViewEngines;
namespace NSemble.Core.Nancy
{
public class RavenViewLocationProvider : IViewLocationProvider
{
private readonly IViewLocationProvider defaultViewLocationProvider;
public RavenViewLocationProvider(IRootPathProvider rootPathProvider)
{
defaultViewLocationProvider = new FileSystemViewLocationProvider(rootPathProvider);
}
public RavenViewLocationProvider(IRootPathProvider rootPathProvider, IFileSystemReader fileSystemReader)
{
defaultViewLocationProvider = new FileSystemViewLocationProvider(rootPathProvider, fileSystemReader);
}
public IEnumerable<ViewLocationResult> GetLocatedViews(IEnumerable<string> supportedViewExtensions)
{
var sb = new StringBuilder();
// Make sure to only load saved views with supported extensions
foreach (var s in supportedViewExtensions)
{
if (sb.Length > 0)
sb.Append("|");
sb.Append("*.");
sb.Append(s);
}
ViewTemplate[] views = null;
using (var session = NSembleModule.DocumentStore.OpenSession())
{
// It's probably safe to assume we will have no more than 1024 views, so no reason to bother with paging
views = session.Advanced.LoadStartingWith<ViewTemplate>(Constants.RavenViewDocumentPrefix, sb.ToString(), 0, 1024);
}
// Read the views from the default location
IEnumerable<ViewLocationResult> defaultViews = defaultViewLocationProvider.GetLocatedViews(supportedViewExtensions);
if (views.Length == 0)
return defaultViews;
var ret = new HashSet<ViewLocationResult>(from v in views
where supportedViewExtensions.Contains(v.Extension)
select new ViewLocationResult(
v.Location,
v.Name,
v.Extension,
() => new StringReader(v.Contents)));
foreach (var v in defaultViews)
ret.Add(v);
return ret;
}
}
}
You will need to register it in your Nancy Boostrapper class:
protected override NancyInternalConfiguration InternalConfiguration
{
get
{
return NancyInternalConfiguration
.WithOverrides(x => x.ViewLocationProvider = typeof (RavenViewLocationProvider))
.WithIgnoredAssembly(asm => asm.FullName.StartsWith("RavenDB", StringComparison.InvariantCulture)); // or override ConfigureApplicationContainer to set AutoRegister to false
}
}
Geo-spatial search with RavenDB 2.0
I recently wrote an article on geo-spatial search with RavenDB for "The Developer", a Norwegian magazine for developers. Seeing how questions on this topic keep coming in, I thought it would be a good idea to post it here as a reference for anyone interested.
So, there you go. Happy searching.
Started work on a “RavenDB in Action” book
I recently signed a contract with Manning for writing a "RavenDB in Action" book. Writing this book comes naturally after delivering many RavenDB workshops, talks, training sessions and consultancy gigs. The structure of the book and the actual content in it are based on actual experience of explaining RavenDB to people of different levels. Also many examples I'll use are going to be taken from real-world scenarios.
This book is aimed at .NET developers with actual programming experience, and I'm working hard to make sure no RDBMS background will be required nor it would not confuse SQL savants. The book is designed to help RavenDB newbies to go from zero to hero. Experienced Ravenees would definitely benefit from reading it as well, as it is going to cover many advanced topics in great detail.
We are already in the middle of writing the first 4 chapters, and hopefully they'll be available on MEAP in two or three months. I'll make sure to announce once it is released...
RavenDB Consultancy & Training
I've been working as a core developer for RavenDB for quite a while, writing core features, providing support to users and customers, and co-authored and delivered the official 2-day RavenDB Workshop. Starting October (today...) I'm offering on-site and remote RavenDB consultancy services, as an independent RavenDB consultant. I'm also available for on-site RavenDB training worldwide (1, 2 or 3 day courses).
I can be contacted either by email (itamar at this domain) or Skype (itamarsyn).
Geo-spatial searches with RavenDB
For quite a while RavenDB had geo-spatial search capabilities, but ever since it was introduced it was limited to finding documents with latitude and longitude within a radius from a given point. In the past few weeks I was working on revamping the Lucene.Net spatial module, and earlier this week the work on that was complete. Next in line was getting those changes into RavenDB. I just finished doing that, and this post is going to show what it can do, and how.
First, a few words on geo-spatial indexes. To be able to represent a shape in an index, and then search for it, shapes are converted to an index-friendly representation. There are quite a few ways to do this, most commonly known approaches are prefix trees and bounding-box. The QuadPrefixTree approach, for example, represents the earth with 4 grid squares at it's first level of precision. The squares are labeled A, B, C and D. The next level of precision introduces another letter to the representation, so we get 16 grid squares - AA, AB, AC, AD, BA, ... and so on. By having this multiple layers of precision, we can create the most efficient representation of a shape which balances number of terms vs precision. Another implementation called GeohashPrefixTree uses geohashes which have more grid squares per layer.
Before diving any deeper, here's how you would perform a simple point and radius spatial search. This is taken directly from the old API (which we revised a bit), and since it's easier to use for the most common usage of geo-spatial searches, we left it mostly intact:
// The spatial index
public class LegacySpatialIndex : AbstractIndexCreationTask<Event>
{
public LegacySpatialIndex()
{
Map = docs => from doc in docs
select new
{
doc.Title,
_ = SpatialGenerate(doc.lat, doc.lng)
};
}
}
// The querying method
public IEnumerable<Event> GetEventsLegacy()
{
IEnumerable<Event> events;
using (var session = store.OpenSession())
{
events = session.Query<Event>()
.Customize(x => x.WithinRadiusOf(10, 32.456236, 54.234053))
.ToList();
}
return events;
}
The new spatial stuff is quite powerful, and we really wanted to keep all that power in your hands. Therefore, when defining an index you get a chance to specify which spatial strategy and what prefix tree "height" to use. You can just use the defaults if you wish to, of course.
Shapes in both documents and queries are represented using WKT - a markup language for representing shapes, so they are as human readable as they can possibly be. Using WKT also frees everyone from hard to use API and tons of classes, at least as long as the shapes you use are simple enough. If you are expecting to handle complex shapes, it is recommended that you install NetTopologySuite from nuget to help you with creating shapes and serializing them to their WKT string representation.
Here is an example of the new capabilities. Please note, I just pushed the code for that in, so the API might change a bit by the time you get to play with it:
public class Event
{
public string Title { get; set; }
// WKT representation of a point on earth, ex. POINT (24.532341 54.352753)</pre>
public string Location { get; set; }
}
public class SpatialIndex : AbstractIndexCreationTask<Event>
{
public SpatialIndex()
{
Map = docs => from doc in docs
select new
{
doc.Title,
_ = SpatialGenerate(fieldName: "Location", shapeWKT: doc.Location,
strategy: SpatialSearchStrategy.GeohashPrefixTree, maxTreeLevel: 12)
};
}
}
public IEnumerable<Event> GetEvents()
{
IEnumerable<Event> events;
using (var session = store.OpenSession())
{
events = session.Query<Event>()
.Customize(x => x.RelatesToShape(fieldName: "Location",
shapeWKT: "Circle(32.454898 53.234012 d=6.000000)", SpatialRelation.Within))
.ToList();
}
return events;
}
This is the unbound version of the API, and you can do quite about anything with it. A few notes about this new API:
- The SpatialGenerate() method in the index definition is expecting a WKT formatted string. It can be any shape you want, but it has to be a legal shape string.
- Specifying a spatial strategy is done when defining the index. Changing a strategy will trigger re-indexing.
- The strategy and maxTreeLevels parameters are completely optional. Only use them if you know what you are doing, otherwise, stick to the defaults.
- You can provide ANY shape while querying, and an expected relation to it. More details on shape relations below.
- The results will be sorted by distance, unless otherwise requested.
- You can store several shapes in one documents, and specify which shape it is you want to query on, using the fieldName argument in both the index definition and the query. However, at this point you can execute a query only against one spatial field at a time (but as many non-spatial fields as you want).
Obviously, one of the benefits of this new implementation is the ability to index any shape, and to issue a query with any shape against them. Circles, points, squares, polygons - RavenDB doesn't care anymore.
There are 3 types of shape relationships that are supported with this new implementation:
- Intersects - querying for a shape which intersects a shape stored in a document within RavenDB will find those shapes which intersect with the given shape. Intersection occurs when the two shapes have at least one shared grid hash. Because of current limitations of the algorithm, very large indexed shapes are not deemed to intersect with very small query shapes. However, smaller indexed shapes will intersect with larger query shapes.
- Disjoint - Finds those indexed shapes which are disjoint to the query shape. This means the the indexed shapes and query shape must have no shared grid hashes.
- Within / Contains - Finds those indexed shapes which are fully contained within the query shape. Unlike
intersects, this means that all of the indexed shape must be present in the query shape. Any shapes which have additional area outside of the query shape are excluded.
Limitations and gotchas:
- Distances with this new implementation are Kilometers, while the old implementation was using Miles. Since this is what the internal implementation uses, and it is hardly exposed to the end user, we kept using the metric system. It is quite easy to convert this back to miles, and if there will be demand we might introduce a configuration option on the server side to do that.
- Handling of polygons which cross the dateline isn't supported at this stage.
- Multi-polygon support is lacking.
This new feature is really neat, and opens up great new opportunities with its simplicity and ease of use. It is available to us thanks to the spatial4j project, and powered by Lucene.Net, Spatial4n and NetTopologySuite.
Leaving Hibernating Rhinos
After working for a while as a core developer for RavenDB, it is time for me to move on. Starting September, I will no longer be with Hibernating Rhinos working, supporting and training on RavenDB as my day job.
I love RavenDB. It is a great product, and I'm sure it will get very far. The design decisions behind it make it be real art. The way it breaks old bad habits we developers have and the design it enforces us to use, all make you build overall better apps. Finally a database that actually helps you do your job without any real compromise.
However, from now on, RavenDB for me will be an open-source project which I'm involved in. As time permits, I will continue to hang out in the mailing list and provide support, perhaps even adding features or fixing bugs from time to time.
I will also continue providing training and consulting, on-site and remote. My next scheduled trip is to London during the first 2 weeks of September. I still have a few open slots for UK-based companies, so feel free to shoot me an e-mail. I will also deliver this free talk on RavenDB at SkillsMatter on September 12th while there, and will be happy to take any RavenDB related questions after it or over a beer (good beer and conversation always buys me off..).
My next gig is with Buzzilla, an Israeli-based company which builds software that can track, monitor and analyze on-line conversation in social networks and the general Internet. A lot of cutting edge stuff is going on there, involving Machine Learning, NLP and search engines, so that's pretty exciting. I'm going to be leading a development team building a new search engine platform, tackling problems like distributed search and multi-lingual content, and working with BigData, which is always fun. Great times ahead!
Hacking with RavenDB’s multi-maps
A couple of months ago I blogged about Orev - OpenRelevance viewer. The purpose of Orev, in short, is to create materials and a sandbox that allow to measure relevance between different full-text search methods.
In Orev we have Corpora, Topics and Judgments. A user is shown a Topic (= a few sentences describing something), and a Corpus Document, and he has to make a Judgment - whether the document is relevant to this Topic or not. By having a lot of judgments on a lot of corpora, using a lot of topics, we can perform automatic searches with different methods, and measure their relevance.
Orev was built using RavenDB as it's back-store, and in this post I'm going to show a nice approach we used to facilitate the judging process.
The Model
To start with, the model is a very simple one - we have Topic, User and Corpus, and of course we have a Judgment.
A Corpus has many CorpusDocuments, which are saved separately and not within the Corpus document itself. This is done for many reasons: they are different transactional units (if I update a typo in a document, the entire Corpus doesn't really change), and we want to retrieve one single document at a time when judging. Also, containing all documents within a parent Corpus document will bloat that document tremendously. So, for all intents and purposes, each CorpusDocument has to be stored as its own document.
And this is how they look:
public class Corpus
{
public string Id { get; set; }
[Required]
public string Name { get; set; }
[Required]
public string Description { get; set; }
[Required]
[StringLength(5, MinimumLength = 5)]
public string Language { get; set; } // a language identifier string, en-US for example
}
public class CorpusDocument
{
public string Id { get; set; }
public string CorpusId { get; set; }
public string Title { get; set; }
public string Content { get; set; }
public string InternalUniqueName { get; set; } // to allow us to track original name in the imported corpus
}
public class Topic
{
public string Id { get; set; }
[Required]
public string Title { get; set; }
[Required]
[DataType(DataType.MultilineText)]
public string Description { get; set; }
[Required]
[DataType(DataType.MultilineText)]
public string Narrator { get; set; }
[Required]
[StringLength(5, MinimumLength = 5)]
public string Language { get; set; } // a language identifier string, en-US for example
/// <summary>
/// Id of user submitting this topic
/// </summary>
public string UserId { get; set; }
}
public class Judgment
{
public enum Verdict
{
Relevant,
NotRelevant,
Skip,
};
[Required]
public string CorpusId { get; set; }
[Required]
public string DocumentId { get; set; }
[Required]
public string TopicId { get; set; }
[Required]
public string UserId { get; set; }
[Required]
public Verdict UserJudgement { get; set; }
}
The Problem
We deployed the application, and imported a lot of Topics, Corpora and CorpusDocuments. Now we want to start generating Judgments. So we let our user select the Corpus he wants to work on, and a Topic to judge CorpusDocuments against. But once we start the judgment process, how can we pull the next CorpusDocument? remember, we have to find one in the selected Corpus that hasn't been judged yet for the selected Topic.
Before jumping ahead to the solution, try to think how you would solve this yourself. Hint: it involves multi-maps.
The Solution
At first glance it seems the query we are going to issue is going to ask RavenDB questions about Judgments. More specifically, it is going to ask it for all Judgments that were not yet made for a specific CorpusDocument and Topic. But how can we query on documents that do not exist?
And then we realize that we are actually querying for a CorpusDocument: when judging, I don't care about other judgments, all I want is to get the next CorpusDocument to show to the user. Another realization is that if I look on all the Judgments made on a specific CorpusDocument, I can get a list of Topics it has been judged against, and perhaps work my way from that. If only I could consolidate both... hmm...
So this is where RavenDB's multi-maps come in. I select all Judgments with their Topic ID within an array, and all CorpusDocuments each with an empty array. This will result in one big set of rows, with each row containing the CorpusDocument ID (which is a document ID + the corpus ID) and one Topic ID there exists a Judgment for. The reason I'm selecting Topics as an array in this stage, is to comply with the format we will produce results in the Reduce step; RavenDB requires all Map and Reduce functions to have the same type of output.
It is important to note ALL corpus documents will be listed, but there may be corpus documents with no topics at all - they will be represented by one row with the CorpusDocument ID, and with an empty string as the Topic ID.
The next thing we want to do is to perform a Reduce step on that set of rows. Notice that if we group all the rows based on the CorpusDocument ID (which includes the Corpus ID), we can have a smaller set of rows, where a CorpusDocument is represented only once, and along with it all the Topic IDs there are Judgments for. So, if we previously had a lot of rows, each row with one CorpusDocument identifier and one Topic identifier, we now consolidated all the data we have for each CorpusDocument into one row per CorpusDocument. And this is exactly what we want to have.
Hence, we write this index:
public class CorpusDocuments_ByNextUnrated : AbstractMultiMapIndexCreationTask<CorpusDocuments_ByNextUnrated.ReduceResult>
{
public class ReduceResult
{
public string DocumentId { get; set; }
public string CorpusId { get; set; }
public string[] Topics { get; set; }
}
public CorpusDocuments_ByNextUnrated()
{
AddMap<CorpusDocument>(docs => from corpusDoc in docs
select new { DocumentId = corpusDoc.Id, CorpusId = corpusDoc.CorpusId, Topics = new[] {string.Empty} }
);
AddMap<Judgment>(judgments => from j in judgments
select new { DocumentId = j.DocumentId, j.CorpusId, Topics = new[] { j.TopicId } });
Reduce = results => from result in results
group result by new { result.DocumentId, result.CorpusId }
into g
select new
{
DocumentId = g.Key.DocumentId,
CorpusId = g.Key.CorpusId,
Topics = g.SelectMany(x => x.Topics).Distinct().ToArray(),
};
TransformResults = (db, results) => from result in results
let doc = db.Load<CorpusDocument>(result.DocumentId)
select doc;
}
}
Now we have an index which contains all the info we need: the CorpusDocument ID, the ID of the Corpus it belongs to, and the list of topics with judgments for each CorpusDocument, where all CorpusDocuments exist, even if they were never judged for any Topic. Performing the actual query is now just a matter of performing a match-all-docs-except query:
var query = RavenSession.Advanced.LuceneQuery<CorpusDocument, CorpusDocuments_ByNextUnrated>()
.Where("Topics:*") // match all docs
.AndAlso()
.WhereEquals("CorpusId", corpusId)
.AndAlso()
.Not
.WhereEquals("Topics", topicId) // remove corpus docs with a particular TopicId attached to them
.RandomOrdering()
.FirstOrDefault();
This will issue a Lucene query like this: Topics:* AND CorpusId:corpus/1 AND -Topics:topics/1
This query will first match all index documents from the given corpus, and then will remove all CorpusDocuments which have a given TopicId attached to them. The way we built the index, if a CorpusDocument has a certain TopicId attached to it in the index, that means a Judgment has previously been made to it; and if a CorpusDocument has already been judged for our Topic, we are not interested in it anymore.
And just to spice things up a bit, I threw in RandomSorting().
WMS: Rethinking our need for CMS
Traditionally, Content Management Systems are about Content. Whenever Data that is not simply a content page was going to be persisted in a CMS, you'd somehow fit it in a Content entity, keeping the thought process always at "how would it look to the end user", never treating it as actual data.
Sure, there is the page-centric CMSes like DNN, and CMSes which try being data-centric. But when talking CMSes, you can either be too focused at solving your problem, in which case unless it is actual Content Pages that you are working with you will probably move away from CMSes anyway, or abstract too much to be a one-size-fits-all, which will always end up badly. There doesn't seem to be any middle ground.
This seems to be a problem with the roots of CMSes, the way they evolved, and with strong ties of it being always backed by a relational database. CMSes were always about "Content", not about "Data", and the relational model doesn't make your life any easier when you have actual data you want to work with (not just "display").
WMS: Redefining CMS
In today's reality, we might choose to continue bending terms and force all those types of systems under the definition of a CMS, sure, but this will most probably end-up in a multi-purpose data-to-ContentPage converter built on your MVC framework of choice, achieving nothing, really. Been there, done that.
Instead, I like to think of a WMS - Website Management System. I want this thing to help me deploy any website quickly and efficiently, allow me to manage my content AND DATA by providing admin screens, play nicely with my IDE, support multi-site nicely, and as a general rule won't stand in my way. This is mainly because I don't like building websites, I like building software that might be served to the world as websites. I don't like being a server administrator, either.
A perfect WMS would have a gallery of modules I could easily integrate, so I could deploy a website, any website, with no hassle, getting full-blown admin screens for each module that are actually fit for managing the data used by the modules. No stretching or rounding corners.
And whenever I'd want to add a new custom functionality to my website, I could just fire up my IDE, implement some interfaces, test, play with the new website locally, deploy, set 2-3 settings, and be done with it.
An important thing to note about a WMS is that it does NOT manage data for you, and also the concept of Content is gone forever. A ContentPages module might exist, but it is going to treat all content-pages as data, the module's data. A WMS is not about persistence, and the only data access it does is for persisting configurations. It might manage data-access for the modules, but only to an extent of being a middleman for serving data requests, requests being created by the module itself. This is a design decision we have to make to make sure we do not pose limits to what we can do with modules; data has to be persisted independently of the WMS, but optionally orchestrated by it.
Spec'ing WMS
So how would one go about implementing such a thing?
First, dropping SQL and using a schema-less database, with document database being probably the best fit here. Getting rid of a schema is the first real step for creating a real-world dynamic applications with good performance.
Then, moving from thinking in "Pages" and starting to think in terms of "Routes". Back in the days when we were displaying only content pages, it made sense to have a URL to a Page. With schema-less model, we have to make our website URLs "schema-less" as well - and this is where routing comes in. Each module will define it's own set of URL templates, and while configuring your website you will tell the WMS which Area, or URL prefix, this module is going to use. Simple modules like a ContentPages one will probably have one rule, and more complex ones like a products catalog or an eCommerce module will have all the flexibility they need.
And this brings me to the MVC pattern, which we probably are going to want to use. To use the MVC pattern correctly, each module will contain the Controllers, Models and Views it is going to use. A Controller must provide all the content for the View to consume, hence making all the data-access operations in the Controllers alone.
This is some high-level description of what I consider to be a fairly robust and flexible WMS, what in my opinion should replace what we know today as a CMS. I'll be writing a follow-up post with some discussion on other aspects which I consider important, namely Menus, Navigation and Hierarchies. But these are at some level just implementation details, which once you get the framework right, will need to fit in nicely.
The future of geo-spatial searches with Lucene
Or: Introducing Spatial4n
The Lucene spatial contrib module has been a nice addition to Lucene, but for a while now too many bug reports have been piling up, and it got to a point where it was clear something was broken somewhere deep inside. Luckily, a bunch of good people started writing their own general purpose geo-spatial library in Java, and provided a Lucene module to interact with it to provide spatial search functionality. This project is called Spatial4j (formerly Lucene Spatial Playground), and it works great, solving all known issues with the previous implementation.
What's even more great about it is it was built from the ground up to support complex searches, like polygons and other custom shapes, as well as different search strategies. This is not just about a circle and a radius anymore. The guys that created it really dig geo-spatial searches, so it is probably going to get a lot better over time.
This library as well as it's accompanying Lucene module are now part of Lucene, and should be available to all when Lucene 4.0 is released.
Since RavenDB uses the old spatial module, we were getting quite a few bug reports ourselves, without really being able to do anything about it. So when we heard about this project, it was clear that we should be using it. And since it is written in Java, well - luckily this isn't the first piece of Java code I've been porting...
The Spatial4n library - the .NET version of Spatial4j - is available here: https://github.com/synhershko/Spatial4n
The Lucene part of things, sync'd with Lucene.NET's trunk, can be found here: https://github.com/synhershko/lucene.net/tree/spatial2trunk . It will be there until those are merged upstream. There is also a branch with the new spatial module that is compatible with the 2.9.4 API - https://github.com/synhershko/lucene.net/tree/spatial.
We had to do some custom coding to get it to work with all the functionality we wanted, but it was all doable and so far this library looks very promising. All it needs now is a bit more attention.