Code972 Coding from the back of a camel

30Jan/120

RavenDB London tour

This February I'll be visiting London, consulting on RavenDB and giving our course at Skills Matter. There is also a free session, in which I'll discuss the RavenDB indexing system and how to make the most out of it.

More info on our RavenDB London course (Feb 28-29): http://skillsmatter.com/course/open-source-dot-net/ayende-rahiens-ravendb-workshop

The "In The Brain" session (Feb 28th, 18:30): http://skillsmatter.com/event/open-source-dot-net/ravendb

I might have a free evening there, and I'll be happy to discuss RavenDB or any other dev topic over a beer. Just saying.

Tagged as: No Comments
6Dec/111

RavenDB Caching done right (EventsZilla part II)

In the previous post we created the basics for an events publishing application, and discussed the modeling aspect of things.

I put some more work into the app, and now it actually works and looks pretty nice. Queries and loads are in place for the front-end, so it is time to visit one key feature of RavenDB - Caching.

Basic caching

The RavenDB Client API provides automatic out-of-the-box caching for all read operations. Every data request sent to the server is being remembered by the document store object, so subsequent  read operation that are detected as identical can return immediately.

However, it is important to beware of common pitfalls which may cause you not to take advantage of this handy feature. While there's no real way to mess up with simple Load operations, it is very easy to do that when querying.

For example, the most common query in an application like EventsZilla is to get events starting before or after a certain point in time, usually DateTimeOffset.Now. However, a query like this is guaranteed to never use the cache, since it is virtually different every time it is called.

In EventsZilla we can fix this relatively easily, by lowering the DateTimeOffset resolution when querying. Another approach will be to round up (or down) the value. The actual resolution or rounding approach we use will determine how much of caching this query will take advantage of.

Relevant code can be found here.

Aggressive caching

Basic caching is very effective, requires no action from the user's end to work, and is a great feature for automatically improving your applications performance. However, a server query is still issued with every read operation to make sure the cache never goes stale. The actual benefit with basic caching is with getting back a quick response of a thin 304 (HTTP for "I haven't changed") instead of a complete 200 response with all the requested data.

At times, we load an object - or perform a query - that we really don't care if it changes for a certain period of time, or we just don't expect it to. If we choose to, we can tell RavenDB not to query the database at all if it has a cached response that is not older than a given point in time.

This feature is called an Aggressive Caching, being aggressive in the sense of not peeking outside the cache at all. Unlike basic caching it is an opt-in feature.

In EventsZilla, this is exactly the case with a website-wide config object. We don't expect it to change a lot, and when it changes, we can bear a certain amount of time until the changes are noticeable in our website.

All we need to do to make it happen is load that object within a context of an AggressiveCache, and the RavenDB Client API will take care of the rest for us.

Using Aggressive Caching is as simple as this:

using (RavenSession.Advanced.DocumentStore.AggressivelyCacheFor(TimeSpan.FromMinutes(30)))
{
	var siteConfig = RavenSession.Load<SiteConfig>(SiteConfig.ConfigName);
}

More on caching

Is in the second part of the excellent RavenOverflow video, available here.

29Nov/110

EventsZilla: RavenDB modeling walkthrough

I needed a simple event publishing application. I also felt like doing another RavenDB sample app and a RavenDB post on it. This is how EventsZilla came to life.

EventsZilla (full sources here: https://github.com/synhershko/eventszilla) is meant to be a simple web application to announce events along with a schedule, which is also capable of viewing past and future events. People should be able to register to an event without registering with the website, and also view slides and other content when it becomes available post-event.

This post is being written during development, describing each stage and the considerations leading to the next. As such, the code I link to does not necessarily work, although it should. I will probably have some fixes and amendments made to the code after publishing this post.

Initial modeling

When we speak of an events publishing application, what are we looking at? The most basic items are an Event with means of registration, and a list of sessions for each Event. Each event should have a registration window, and a venue in which it takes place, and obviously title and description.

For each session in an event we want to have a Presenter (possibly more than one), a title, a brief (aka abstract), and times in which each session starts and ends. We should note the start and end time of the event are going to be derived directly from the first and last sessions of the event. For now we call a session a "Schedule slot".

Unlike a relational model, with RavenDB we can sketch the entire thing as one class and just use it. There is one exception though - at this stage we already know venues and presenters might be showing several times in different events (maybe even the same presenter in multiple sessions in the same event), so we don't want to store them directly under the event, but rather link to them by storing their IDs only. They could be efficiently retrieved using the Includes feature.

We end up with this Event class:

	public class Event
	{
		public Event()
		{
			Schedule = new List<ScheduleSlot>();
		}

		public int Id { get; set; }
		public string Title { get; set; }
		public string Slug { get; set; }
		public string Description { get; set; } // markdown content

		public string VenueId { get; set; }

		public DateTimeOffset CreatedAt { get; set; }

		public DateTimeOffset RegistrationOpens { get; set; }
		public DateTimeOffset RegistrationCloses { get; set; }
		public int AvailableSeats { get; set; }

		public class ScheduleSlot
		{
			public List<string> PresenterIds { get; set; } // list of person IDs
			public string Title { get; set; }
			public string Brief { get; set; } // markdown
			public DateTimeOffset StartingAt { get; set; }
			public DateTimeOffset EndingAt { get; set; }
		}
		public List<ScheduleSlot> Schedule { get; set; }

		public DateTimeOffset StartsAt
		{
			get
			{
				var firstSession = Schedule.OrderBy(x => x.StartingAt).FirstOrDefault();
				return firstSession == null ? DateTimeOffset.MinValue : firstSession.StartingAt;
			}
		}

		public DateTimeOffset EndsAt
		{
			get
			{
				var lastSession = Schedule.OrderByDescending(x => x.EndingAt).FirstOrDefault();
				return lastSession == null ? DateTimeOffset.MaxValue : lastSession.EndingAt;
			}
		}
	}

Since an event schedule has no meaning outside the scope of an event, it is best persisted there as well. It also means the whole schedule will be loaded with the event with each Load or Query operation this event will be part of. At this stage we are fine with that.

The StartsAt and EndsAt properties of the Event are persisted this way to take some pressure off the indexes we are going to create, so business logic will reside in the actual domain types instead of in the indexes as much as possible.

The Venue and Presenter classes are quite trivial ones, so won't be shown here.

The actual code for this phase is in this github commit.

Event registration

Registering to an event is quite a common operation in our system, and in a crowded website multiple registrations to the same event can be made at the same time.

Like an event schedule, an event registration has no meaning at all outside the scope of the event itself, at least as long as we don't try and keep track of attendees (which we don't). Unlike the schedule, the attendees list is going to change quite a lot, and often at the same time. For this reason, keeping this list within the Event object itself won't make sense, as it will require us to start thinking about conflict resolution, when 2 or more people try to register for the same event on the same time.

Another reason not to save registrations within the Event itself is we don't really care about it when we load an event, and that list can grow quite big for certain events. We want to make sure the Event object only holds data we are going to access frequently at that context; the registrants list is not that type of data.

To keep the registrants list separate, while still making sure we don't need to worry about possible conflicts, I created a simple EventRegistration class which will hold of all that data. We persist it exactly that way, and whenever we need to know the amount of people who registered to the event, we query the DB for a count of registrations for that event. That query is using a simple static index we defined upfront.

	public class EventRegistration
	{
		public string EventId { get; set; }
		public string RegistrantEmail { get; set; }
		public string RegistrantName { get; set; }
		public DateTimeOffset RegisteredAt { get; set; }
	}

Actual code for this phase is in this commit.

Available seats

Still in the context of registrations, it is important to note that by design we are not necessarily blocking registration for an event the second it is full. The reason for this is that we are getting the number of people that registered to the event through an index query, and while RavenDB's indexing process is quite fast, it is possible on a busy websites this query will return a number that is not entirely up to date.

In EventsZilla, this is a design decision we made, not to think like a computer. Your PC knows to respect a hard limit, but in life, we hardly really do that. So if your event has 100 seats, wouldn't you be able to squeeze 10-15 more? and are you really that sure all of the original 100 registrants will indeed show up?

For that reason we don't care if the count we got back from our query is not the actual count at the point of time where we issued the query. This is quite a common practice with Eventual Consistency - we don't try hard to respect hard limits that don't really exist anyhow, and wonderful things happen.

RavenDB can tell us the count we got is stale. Waiting for non-stale results is possible, but in production is really not recommended, as it is going to significantly slow down your system. In some cases it can also result in an infinite wait. So don't do that.

If we really needed to keep to a hard limit, we could add a RegistrantsCount property to our Event class, and increment it with every registration. It requires a bit more work to make sure concurrent writes are detected, and one is delayed, so no incorrect counts happen, but it will ensure we can know the exact count at all times, since we could then retrieve that event using a Load operation, which is ACID.

Next in line

As time allows, I will explore our possibilities for times when we want to enable different tracks in the event, and to better support events with only one session. Also in my to do list is letting each schedule slot have materials such as slides, video, code samples etc. Stay tuned.

7Nov/110

The Oredev “Lost Session”

This week I'm in Malmo, Sweden for Oredev - looking forward for a great conference.

Wednesday evening, about an hour after the last session for the day, I will be giving a RavenDB session in KAN's offices. There are a few seats available - more details and registration here: http://thelostsession.kan.se/.

If you live nearby, or attending the conference, we would love to see you there. The evening is free, and there will be food and beers.

16May/114

SisoDB: The wrong solution to the wrong problems

Data structures are the corner stone of computing. If you get them done right, you will most probably succeed in your mission of delivering an application that uses its resources wisely and performs well.

In modern computing, most data is stored to and retrieved from databases. Databases are data structures' big brothers - they serve the same purpose but with added value. Choosing one wisely can greatly help you in some many ways; going with the wrong one would cost you too much.

This is why one should not take lightly the decision of which database solution to use.

Dealing with data explosion

Since the 70's, whenever data had to be persisted, RDBMSes were the most effective and trusted tools to use. Since OOP became dominant developers found it quite itching to stuff their hierarchical entities into the flat structure of Tables and Rows, which is the ABC of RDBMSes. This is how ORMs came to life.

Coming to think of it retrospectively, ORMs were never the solution. They just made the problem less itching. In practice, your data still had to go through quite an awful lot of processing until it was persisted to, or loaded from store. But as long as it was transparent for the developer, and he knew that loads of optimization is happening under the hood, it seemed like there's nothing to worry about.

Although the concepts were known since the 80's, it was not until recent years real object-, document-, and graph-databases came into life. It took big players like Facebook and Twitter to get those ideas to mature and become production ready. Someone (or a handful of them) realized a shift in thinking is essential, and real-world problems like replication and sharding suddenly seemed a lot less complicated. As a result the NoSQL movement (or whatever it has become) is now full-steam ahead, and data-access best practices are being re-written.

Each NoSQL brand introduces some cool unique features, never seen in RDBMSes before. Document-oriented databases introduced the "schema-less" concept. That is, unlike in traditional RDBMSes, defining a data scheme is no longer required. The DocDB would either figure it out on its own, or it wouldn't even bother to. Data schemes are required in RDBMSes to define the table structure and allow for efficient indexing; DocDBs have a different go at it - Map/Reduce.

SisoDB choosing the wrong battle

SisoDB is the new face in town, but it looks like it is choosing the wrong battle. The problems it tries to solve are not real problems. Let me explain.

The SisoDB website explains the motivation behind SisoDB: the need of a real schema-less solution for data storage, while at the same time making sure the powerful tools offered by SQL Server are still available. ORMs are deemed evil because they require mappings, which contradicts the notion of schema-less, and non-MS-SQL backend is probably deemed irrelevant too. This is probably why there are no providers for Oracle nor MySQL.

So, in SisoDB data is now schema-less, but it spans over 3 tables per entity. This is how it looks (taken from the SisoDB site):

And the question arises: if real schema-less database is what you're after, a direct-POCO-to-storage-and-back-again solution, why would you use SisoDB with SQL Server in the first place? You can just use a NoSQL schema-less database, and if you treasure MSSQL's reporting tools that much just find a way to still be able to use them! When resorting to not using a NoSQL database, you lose ALL the possible sweet spots such products have to offer - which MsSQL offers none. And there are so many of them.

Nowadays it doesn't make sense to use SisoDB, neither in new development nor in existing applications. It may feel like being schema-less, but its fundaments are too deep in the RDBMS world, and it shows - to name a few:

  • Deep hierarchies and enumerables are not supported
  • Entity ids ought to be named SisoDb, making it harder to integrate with existing code
  • You can't specify string ids for entities (ids have to be int or Guids)
  • You have to CREATE your databases
  • For every model change you have to tell SisoDB to update the model; it will not be detected automatically, and a schema update is still required.
  • Various SQL common faults, like SELECT N+1 or not batching where possible.
  • Sharding and replication, other strong characteristics of NoSQL databases, are by definition one mile behind.

Some performance numbers were posted by the author comparing SisoDB and other ORMs for inserts. But queries are what you should really care about; and you are going to be disappointed. The most extensive indexing feature SQL has - relations between tables - is not being used in SisoDB by design. SisoDB doesn't define FKs, and doesn't operate JOINs. Put simply this means that by design SisoDB harms lookups performance, which is hands down the most crucial part of your application. You don't want this.

Just for comparison: RavenDB is a document database written in .NET, schema-less too and uses POCOs or raw JSON, with no mapping whatsoever, which uses Linq for querying. But it is real NoSQL, and as such it is offering much more natural replication and sharding functionality. Other features include full-text search out-of-the-box, entity versioning, REST API, complex and super-fast indexes, embedded mode, Silverlight support, and much more. And RavenDB comes with the ability to replicate its indexes to MsSQL so the reporting tools can still be used even though you're in NoSQL land.

If you were able to convince your bosses to use NoSQL, go with a real NoSQL solution. If not, try again. If you still fail, just keep using your favorite ORM and if mapping annoys you find ways to automate that process instead.

Tagged as: , , 4 Comments