Sunday, December 29, 2013

Implementing a Treemap in C#

I was wondering how to implement a treemap in C#. A treemap is a data visualisation technique that looks like this



I know a few examples exist online but between a very abstract paper like this one and a javascript implementation on GitHub I was wondering does one could write an algorithm from scratch using a naive approach.

Let's start with a simple example, the geometric series 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64… would give us something like this



My guess would be to use a recursive algorithm to create our treemap. First we need to order all the elements from biggest to smallest. Then we start to divide the area for each elements starting with the first element which takes half the available space. After that we divide the remaining area with the rest of the elements. We repeat this process until we have all the pieces.

Now if we take another more realistic example



Here every time we need to slice the area we need to make sure it won't be too small or won't look so great. If the element represent 50% or more of the total it won't be a problem but what about only 10% or 4%? I think we should set a minimum threshold for our slice, let say 25% for now. We will experiment with this value once we finish our algorithm.

So what happen if the largest value is below our 25% threshold? I think we should include more elements in the slice until we reach at least 25%. For example if we have 14%, 8% and 5% the total is 27%, so our first slice will be 27% of the available space. Then we need to distribute the 3 elements in that slice. This is in essence a subset of our original problem so we can repeat the process just for that slice.

By the way what orientation the first slice should be? I think we should always slice on the longest side of the area rectangle. If we have a square then it doesn't matter really.

Next, how do we divide the first slice? We have 3 elements: 14%, 8% and 5%. If we look for a solution from the start we now have to following proportions: 51.8%, 29.6% and 18.5%. We can take a new slice only for the first element. And we repeat the process for the last 2 elements. Which is now 61.5% and 38.5%. At each step we need to evaluate if the slice will be horizontal or vertical depending on the the shape of the area we have.

Finally, all we have to do is repeat this until all the elements are placed!

Here is my implementation in LinqPad in 3 parts

Slice calculation
public Slice<T> GetSlice<T>(IEnumerable<Element<T>> elements, double totalSize, 
 double sliceWidth)
{
 if (!elements.Any()) return null;
 if (elements.Count() == 1) return new Slice<T> 
  { Elements = elements, Size = totalSize };
 
 var sliceResult = GetElementsForSlice(elements, sliceWidth);
 
 return new Slice<T>
 {
  Elements = elements,
  Size = totalSize,
  SubSlices = new[]
  { 
   GetSlice(sliceResult.Elements, sliceResult.ElementsSize, sliceWidth),
   GetSlice(sliceResult.RemainingElements, 1 - sliceResult.ElementsSize, 
    sliceWidth)
  }
 };
}

private SliceResult<T> GetElementsForSlice<T>(IEnumerable<Element<T>> elements,
 double sliceWidth)
{
 var elementsInSlice = new List<Element<T>>();
 var remainingElements = new List<Element<T>>();
 double current = 0;
 double total = elements.Sum(x => x.Value);
 
 foreach (var element in elements)
 {
  if (current > sliceWidth)
   remainingElements.Add(element);
  else
  {
   elementsInSlice.Add(element);
   current += element.Value / total;
  }
 }
 
 return new SliceResult<T> 
 { 
  Elements = elementsInSlice, 
  ElementsSize = current,
  RemainingElements = remainingElements
 };
}

public class SliceResult<T>
{
 public IEnumerable<Element<T>> Elements { get; set; }
 public double ElementsSize { get; set; }
 public IEnumerable<Element<T>> RemainingElements { get; set; }
}

public class Slice<T>
{
 public double Size { get; set; }
 public IEnumerable<Element<T>> Elements { get; set; }
 public IEnumerable<Slice<T>> SubSlices { get; set; }
}

public class Element<T>
{
 public T Object { get; set; }
 public double Value { get; set; }
}

Generating rectangles using leaf slice (slice with only one element in it)
public IEnumerable<SliceRectangle<T>> GetRectangles<T>(Slice<T> slice, int width, 
 int height)
{
 var area = new SliceRectangle<T>
  { Slice = slice, Width = width, Height = height };
 
 foreach (var rect in GetRectangles(area))
 {
  // Make sure no rectangle go outside the original area
  if (rect.X + rect.Width > area.Width) rect.Width = area.Width - rect.X;
  if (rect.Y + rect.Height > area.Height) rect.Height = area.Height - rect.Y;
  
  yield return rect;
 }
}

private IEnumerable<SliceRectangle<T>> GetRectangles<T>(
 SliceRectangle<T> sliceRectangle)
{
 var isHorizontalSplit = sliceRectangle.Width >= sliceRectangle.Height;
 var currentPos = 0;
 foreach (var subSlice in sliceRectangle.Slice.SubSlices)
 {
  var subRect = new SliceRectangle<T> { Slice = subSlice };
  int rectSize;
  
  if (isHorizontalSplit)
  {
   rectSize = (int)Math.Round(sliceRectangle.Width * subSlice.Size);
   subRect.X = sliceRectangle.X + currentPos;
   subRect.Y = sliceRectangle.Y;
   subRect.Width = rectSize;
   subRect.Height = sliceRectangle.Height;
  }
  else
  {
   rectSize = (int)Math.Round(sliceRectangle.Height * subSlice.Size);
   subRect.X = sliceRectangle.X;
   subRect.Y = sliceRectangle.Y + currentPos;
   subRect.Width = sliceRectangle.Width;
   subRect.Height = rectSize;
  }
  
  currentPos += rectSize;
  
  if (subSlice.Elements.Count() > 1)
  {
   foreach (var sr in GetRectangles(subRect))
    yield return sr;
  }
  else if (subSlice.Elements.Count() == 1)
   yield return subRect;
 }
}

public class SliceRectangle<T>
{
 public Slice<T> Slice { get; set; }
 public int X { get; set; }
 public int Y { get; set; }
 public int Width { get; set; }
 public int Height { get; set; }
}

Drawing the rectangles in WinForm
public void DrawTreemap<T>(IEnumerable<SliceRectangle<T>> rectangles, int width, 
 int height)
{
 var font = new Font("Arial", 8 );

 var bmp = new Bitmap(width, height);
 var gfx = Graphics.FromImage(bmp);
 
 gfx.FillRectangle(Brushes.Blue, new RectangleF(0, 0, width, height));

 foreach (var r in rectangles)
 {
  gfx.DrawRectangle(Pens.Black, 
   new Rectangle(r.X, r.Y, r.Width - 1, r.Height - 1));

  gfx.DrawString(r.Slice.Elements.First().Object.ToString(), font, 
   Brushes.White, r.X, r.Y);
 }

 var form = new Form() { AutoSize = true };
 form.Controls.Add(new PictureBox()
  { Width = width, Height = height, Image = bmp });
 form.ShowDialog();
}

And finally to generate a Treemap in LinqPad
void Main()
{
 const int Width = 400;
 const int Height = 300;
 const double MinSliceRatio = 0.35;

 var elements = new[] { 24, 45, 32, 87, 34, 58, 10, 4, 5, 9, 52, 34 }
  .Select (x => new Element<string> { Object = x.ToString(), Value = x })
  .OrderByDescending (x => x.Value)
  .ToList();

 var slice = GetSlice(elements, 1, MinSliceRatio).Dump("Slices");
 
 var rectangles = GetRectangles(slice, Width, Height)
  .ToList().Dump("Rectangles");
 
 DrawTreemap(rectangles, Width, Height);
}

References:

Wednesday, August 28, 2013

Windows Azure Caching and transient faults

When using remote services over the wire we should always plan for transient failures. Windows Azure Caching like any services in an Azure world is prone to such problem. Out of the box making calls to the cache server will fail from time to time due to network issues. Typically you will get those kind of exceptions:
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later.
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out.
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0016>:SubStatus<ES0001>:The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server.

For that reason it is a best practice to implement some kind of retry logic around your code calling the cache server. We could have used the Transient Application Block to manage that. But a few months ago I found somewhere that from time to time the DataCache object lose it's internal connection to the cache server. A simple way to fix this is to re-create a DataCache instance and retry the operation.

In the implementation below I'm keeping a reference to the DataCacheFactory and DataCache objects (another best practice). The CreateDataCache factory method will come handy later.

public class CachingService
{
    private DataCacheFactory cacheFactory;
    private DataCache cache;
   
    private DataCache Cache
    {
        get
        {
            if (this.cache == null)
            {
                this.CreateDataCache();
            }

            return this.cache;
        }
    }

    private void CreateDataCache()
    {
        this.cacheFactory = new DataCacheFactory();
        this.cache = this.cacheFactory.GetDefaultCache();
    }

    // ...
}

Then I have this SafeCallFunction I use whenever I want to work with the DataCache object. Notice that the only thing I do to retry the operation is to call the factory method to re-create the DataCache object.

private object SafeCallFunction(Func<object> function)
{
    try
    {
        return function.Invoke();
    }
    catch (DataCacheException)
    {
        // Retry by first re-creating the DataCache
        try
        {
            this.CreateDataCache();
            return function.Invoke();
        }
        catch (DataCacheException)
        {
            // Log error
        }
    }

    return null;
}

Finally in the rest of the class I can use the SafeCallFunction like this
public object CacheGet(string key)
{
    return this.SafeCallFunction(() => this.Cache.Get(key));
}

public void CachePut(string key, object cacheObject)
{
    this.SafeCallFunction(() => this.Cache.Put(key, cacheObject));
}

public void CacheRemove(string key)
{
    this.SafeCallFunction(() => this.Cache.Remove(key));
}

So far after a few weeks of using this implementation the single retry never failed on us. Before that we had around 5-10 failures daily for about 500k calls to the cache server. I would still recommend using a more robust retry policy with Windows Azure Caching but I think it's interesting to know that simply instantiating a new DataCache can fix most failures.

References

Caching in Windows Azure
Best Practices for using Windows Azure Cache
Optimization Guidance for Windows Azure Caching
The Transient Fault Handling Application Block

Wednesday, July 31, 2013

Using Windows Azure Caching efficiently across multiple Cloud Service roles

Windows Azure Caching is a great way to improve performance of your Azure application at no additional cost. The cache is running alongside your application in Cloud Service roles. The only thing you need to decide is how much memory of the role you want use for it (for co-located cache role). You can also dedicate the all memory of a role to caching if you want (with dedicated cache role). Roles that host caching are called cache clusters.

Starting with Azure Caching is so easy that it can be a while before you fully understand the best way to use it. On a recent project my first tough was to enable caching on all the Cloud Service roles as co-located service. This was causing us problems.

First of all, been a developer I debug my application using the local Azure compute emulator. The emulator runs one cache service for each instances of roles with cache clusters. The application has 2 web roles and 1 worker role so when I start a debugging session with multiple instances per role I need a lot of memory to run everything. More importantly, cache clusters do not share cached data between each other. This caused us to have stale data in the application.

That is when I figured out that I needed to read a bit more on Azure Caching if I was to use it efficiently.

Understanding cache clusters


When you enable caching on an Azure role each instance of that role will run a cache service using a portion of the memory (or all of it if it's a dedicated cache role). The cache services running on each instance of a single role are managed as a single cache cluster. Cache services can talk to each other and synchronize data but only inside the same cache cluster (same role). That is why enabling caching of many roles might not be the best thing to do.


Another thing to mention is that cache clusters can only be created on small role instances or bigger. The reason is that with extra small instance you only get 768MB of RAM which is pretty much all used up by anything you run on those instances.

Now the enable a cache cluster on your role go to the role property page on the Caching tab.


Here you will notice that I also enabled notifications which will allow us to efficiently use local caches later.

For more information on the different configuration options for cache clusters go here.

Configuring roles to use cache clients


Now that we took care of the server side of caching configuration let's talk about the client side. Each instances of each roles inside the same cloud deployment can connect to a cache cluster. If you run only one cluster then you are guarantied to access the same cached data from whatever role you are inside your application (as long you have a valid configuration).


One nice feature we can enable in each role configuration is the local cache client. With this we can cache data locally in a role instance memory the data we recently fetched from the cache cluster for even faster access. Remember the Notification option we enabled on the server side? Using the configuration below in the Web.config or App.config of your role will ensure data stored in the local cache client gets updated whenever the cache server version of that data changes. Basically, the local cache client will invalidate data based on notifications received from the cache cluster.


For more information on client side configuration go here.

Other concerns


This post is only about an overview of the consideration of running multiple clusters versus a single one. Using Azure Caching there are a lot more configuration options you need to take a look at here. Also really important is how to use Azure Caching in your application.

Conclusions


I've spend a lot of time figuring out how all of this was working. I hope this post will help you with your learning experience.

Other useful links




Monday, July 22, 2013

Handling Azure Storage Queue poison messages

This post will talk about what to do now that we are handling poison messages in our Azure Storage Queues.

First, let's review what we'll done so far.


Messages that continuously fail to process will end up in the Error Queue. Someone asked me why do we need error queues at all? We could simply log the errors and delete the message right? Well, if you have a really efficient and pro-active DevOps team I suppose logging errors along with the original messages ought to be enough.
Someone will review why the message failed and if it was only a transient error then he could send the original message again in the queue.

We could also store failed messages into an Azure Storage Table.
Then we could simply monitor new entries in this table and act on it. Again, the original message should be stored in the table so we could send it again if we choose.

I think the best reason to use a queue for error message is if you want to have an administrative tools to monitor, review and re-send messages. In this case the queue mechanics let's you handle those messages like any other process using queues do.

For me one of the unpleasant side effect of using error queues is that all the queues in my system are now multiplied by two (one normal and one error queue). It's not too bad if your naming scheme is consistent but even then if you do operational work using a tool like Cerebrata Azure Management Studio or even from Visual Studio's Server Explorer you will feel overwhelmed by the quantity of queues.

Managing queue messages with Cerebrata Azure Management Studio
Managing queue messages with Visual Studio Server Explorer

Whatever you do, I suggest you always at least log failures properly.  Later while debugging the issue you will be thankful to easily match the failure logs with the message who caused it.

Friday, June 21, 2013

Windows Azure Storage Queue with error queues

Windows Azure Storage Queue are an excellent way to execute tasks asynchronously in Azure. For example a web application could let all the heavy processing to a Worker Role instead of doing it itself. That way requests will complete faster. Queues can also be used to decouple communication between two separated applications or two components of the same application.

One problem with asynchronous processing is what to do when the operation fails? When using synchronous patterns like a direct call we usually return an error code or a message or throw an exception. When we use queues to delegate the execution to another process we can't notify the originator directly. One thing we can do is to send a message back to the originator through another queue, like a callback. This is interesting when the process produce a result normally, an error in this case is only another kind of result.


Another way is to create an error queue (or dead letter queue) for poison messages where we put all messages that failed processing. This way we get a list of all the failing messages to review, find out what was the problem with them and figure out what to do about it. For example we can retry the message by moving it back to the main queue so it can be processed again.


Now, let see how we can implement an error queue using Windows Azure Storage Queue.

Implementation of an error queue


First we will initialize the queues. For each queue we also create an '<queuename>-error' queue.
var storageAccount = CloudStorageAccount.Parse("UseDevelopmentStorage=true");
var queueClient = storageAccount.CreateCloudQueueClient();

this.taskQueueReference = queueClient.GetQueueReference("task");
this.taskErrorQueueReference = queueClient.GetQueueReference("task-error");
 
this.taskQueueReference.CreateIfNotExists();
this.taskErrorQueueReference.CreateIfNotExists();

Next we add a few messages with one that will cause the processing to fail (to simulate failures)
this.taskQueueReference.AddMessage(
    new CloudQueueMessage("Message " + DateTime.UtcNow.Ticks));

this.taskQueueReference.AddMessage(
    new CloudQueueMessage("Message " + DateTime.UtcNow.Ticks));

this.taskQueueReference.AddMessage(
    new CloudQueueMessage("Error " + DateTime.UtcNow.Ticks));

Finally the code to actually poll the queue for messages. Usually polling is done in an infinite loop but when no message is fetched it is a good practices to wait a while before polling again to prevent unnecessary transaction cost and IO (each call to GetMessages is 1 transaction). Depending on the need for the queue to react rapidly to new messages this may go between a few seconds for critical tasks to a few minutes for non critical tasks. Also I'm using a retry mechanism here, meaning that I'll try to process a message a few times before I really consider it in error (poison). If we don't delete a message after fetching it then after some time it goes back in the queue to be processed again. This mean all tasks we want to process using queues should be idempotent
private void PollQueue()
{
 IEnumerable<CloudQueueMessage> messages;
 
 do
 {
  messages = this.taskQueueReference
   .GetMessages(8, visibilityTimeout: TimeSpan.FromSeconds(10));

  foreach (var message in messages)
  {
   bool result = false;
   try
   {         
    result = this.ProcessMessage(message);
    
    if (result) this.taskQueueReference.DeleteMessage(message);
   }
   catch (Exception ex)
   {
    this.Log(message.AsString, ex);
   }
   
   if (!result && message.DequeueCount >= 3)
   {
    this.taskErrorQueueReference.AddMessage(message);
    this.taskQueueReference.DeleteMessage(message);
   }
  }
 } while (messages.Any());
}

private bool ProcessMessage(CloudQueueMessage message)
{
 if (message.AsString.StartsWith("Error")) throw new Exception("Error!");
 
 return true;
}

First I'm fetching messages by batch of 8 in this case. In one transaction you can fetch between 1 and 32 messages. Also I set the visibilityTimeout to 10 seconds. This means the messages won't be visible to anyone during that time. Usually you want to set the visibility timeout based on how much time should be required to process all the messages of the batch. If we don't have the time to delete the messages from the queue before the timeout elapse another worker could fetch the message and start processing it again. So we should balance the time to process all the messages in one batch with how much time we want to allow between retries.

Next we process the message. If the processing is successful we simply return true so the message can be deleted from the queue. If processing failed we have two options, return false or throw an exception. I simply return false instead of throwing an exception most of the time when the failure is expected.

Finally, we check how many times we unsuccessfully tried to process the message and if we reached our limit (in this case 3 times). If we did then it's time to send that message to the error queue and delete it from the normal queue.

Next time we will look at how we want to handle the messages in the error queue. You can find this post here.

Thursday, May 9, 2013

Using Azure Blob Storage to store documents

Last time I wrote about Implementing a document oriented database with the Windows Azure Table Storage Service I was using the Table Storage Service to store serialized documents into an entity's property. While it is an easy way to store complex objects the table storage is usually meant to storage primitives like int, bool, date and simple strings. However there is another service in the Windows Azure Storage family who is better suited to store documents: the Blob Storage Service. The blob storage use the metaphor of files which is in essence documents.

Now I will adapt the Repository I did in my previous post to use the blob storage this time. I'll only walk through the changes I'm making here.

Constructor


First, we need to create a CloudBlobContainer instance in the constructor. Please note that for blob storage container names are required to be lower-case.

public class ProjectRepository
{
    private CloudTable table;
    private CloudBlobContainer container;
    
    public ProjectRepository()
    {
        var connectionString = "...";
        
        CloudStorageAccount storageAccount = 
            CloudStorageAccount.Parse(connectionString);

        var tableClient = storageAccount.CreateCloudTableClient();
        this.table = tableClient.GetTableReference("Project");
        this.table.CreateIfNotExists();
        
        var blobClient = storageAccount.CreateCloudBlobClient();
        this.container = blobClient.GetContainerReference("project");
        this.container.CreateIfNotExists();
    }
    // ...
}

Insert


Next for the Insert method, we no longer store the document in a property of the ElasticTableEntity object. Instead we want to serialize the document into the JSON format and upload it as a file to the blob storage and set the ContentType of that file to application/json. For the blob name (or path) the pattern I'm using looks like this: {document-type}/{partition-key}/{row-key}.

public void Insert(Project project)
{
    project.Id = Guid.NewGuid();
        
    var document = JsonConvert.SerializeObject(project,
        Newtonsoft.Json.Formatting.Indented);

    var partitionKey = project.Owner.ToString();
    var rowKey = project.Id.ToString();
 
    UploadDocument(partitionKey, rowKey, document);
  
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
  
    entity.Name = project.Name;
    entity.StartDate = project.StartDate;
    entity.TotalTasks = project.Tasks.Count();
  
    this.table.Execute(TableOperation.Insert(entity));
}

private void UploadDocument(string partitionKey, string rowKey, string document)
{
    var filename = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var blockBlob = this.container.GetBlockBlobReference(filename);
  
    using (var memory = new MemoryStream())
    using (var writer = new StreamWriter(memory))
    {
        writer.Write(document);
        writer.Flush();
        memory.Seek(0, SeekOrigin.Begin);
   
        blockBlob.UploadFromStream(memory);
    }
  
    blockBlob.Properties.ContentType = "application/json";
    blockBlob.SetProperties();
}

Load


For the Load method we can get the blob name using the PartitionKey and RowKey then download the document from blob storage. In DownloadDocument I'm using a MemoryStream and StreamReader to get the serialized document as a string.

public Project Load(string partitionKey, string rowKey)
{
    var blobName = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var document = this.DownloadDocument(blobName);
    return JsonConvert.DeserializeObject<Project>(document);
}

private string DownloadDocument(string blobName)
{
    var blockBlob = this.container.GetBlockBlobReference(blobName);
  
    using (var memory = new MemoryStream())
    using (var reader = new StreamReader(memory))
    {
        blockBlob.DownloadToStream(memory);
        memory.Seek(0, SeekOrigin.Begin);
   
        return reader.ReadToEnd();
    }
}

List


In the first List method we want to get all documents of the same partition. We can do that by directly using the ListBlobs method of CloudBlobDirectory. For the ListWithTasks method we still need to query the table storage first to know which documents contain at least one task. Then with the entities we'll know the RowKey value of those documents so we can simply call the Load method we just saw.

public IEnumerable<Project> List(string partitionKey)
{
    var listItems = this.container
        .GetDirectoryReference("project/" + partitionKey).ListBlobs();
  
    return listItems.OfType<CloudBlockBlob>()
        .Select(x => this.DownloadDocument(x.Name))
        .Select(document => JsonConvert.DeserializeObject<Project>(document));
}
    
public IEnumerable<Project> ListWithTasks(string partitionKey)
{
    var query = new TableQuery<ElasticTableEntity>()
        .Select(new [] { "RowKey" })
        .Where(TableQuery.CombineFilters(
            TableQuery.GenerateFilterCondition("PartitionKey", 
                QueryComparisons.Equal, partitionKey),
            TableOperators.And,
            TableQuery.GenerateFilterConditionForInt("TotalTasks", 
                QueryComparisons.GreaterThan, 0)));
       
    dynamic entities = table.ExecuteQuery(query).ToList();
 
    foreach (var entity in entities)
        yield return this.Load(partitionKey, entity.RowKey);
}

Update


To update a document now we also need to serialize and upload the new version to blob storage.

public void Update(Project project)
{
    var document = JsonConvert.SerializeObject(project, 
        Newtonsoft.Json.Formatting.Indented);

    var partitionKey = project.Owner.ToString();
    var rowKey = project.Id.ToString();
 
    UploadDocument(partitionKey, rowKey, document);
        
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
    entity.ETag = "*";
        
    entity.Name = project.Name;
    entity.StartDate = project.StartDate;
    entity.TotalTasks = project.Tasks != null ? project.Tasks.Count() : 0;
        
    this.table.Execute(TableOperation.Replace(entity));
}

Delete


Finally, deleting a document now requires us to call Delete on the CloudBlobContainer reference.

public void Delete(Project project)
{
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = project.Owner.ToString();
    entity.RowKey = project.Id.ToString();
    entity.ETag = "*";
        
    this.table.Execute(TableOperation.Delete(entity));
    
    this.DeleteDocument(entity.PartitionKey, entity.RowKey);
}
 
public void Delete(string partitionKey, string rowKey)
{
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
    entity.ETag = "*";
        
    this.table.Execute(TableOperation.Delete(entity));
 
    this.DeleteDocument(partitionKey, rowKey);
}

private void DeleteDocument(string partitionKey, string rowKey)
{
    var blobName = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var blockBlob = this.container.GetBlockBlobReference(blobName);
    blockBlob.Delete(DeleteSnapshotsOption.IncludeSnapshots);
}

Conclusion


Using both Tables and Blobs Storage Services we can get the best of both worlds. We can query for document's properties with table storage and we can store documents larger than 64KB in blob storage. Of course now almost all operations on my Repository requires two calls to Azure. Currently those are done sequentially, waiting for the first call to complete before the doing the second call. I should fix that by using the asynchronous variants of storage service methods like the BeginDelete/EndDelete method pair on CloudBlobContainer.

I hope this post is giving you ideas on new and clever ways you can use the Windows Azure Storage Services in your projects.

See also

- Using Azure Table Storage with dynamic table entities
- Document oriented database with Azure Table Storage Service

Tuesday, April 9, 2013

Document oriented database with Azure Table Storage Service

What is a Document database?


A document store or document oriented database like RavenDB is a kind of NoSQL database where we store semi structured data in documents and use a key to retrieve existing documents.

The difference with a relational database is that a complex data structure needs to be represented by entities and relations in a RDMS which means using many tables, joints and constraints. In a document database the whole graph of entities is stored as a single document. Of course we still need to handle some relations between documents or graphs of entities. This is done by storing other documents key inside the document and acting like a foreign keys.

Another way to see documents is to think that all tables related together with a delete cascade constraints on the relations are part of the same document. If a piece of data from a table can only exists if related data from another table also exists it means both should be part of the same document.

The concept of document relates well with Aggregates of Domain Driven Design.

Implementing a document database with the Azure Table Storage Service


Looking at RavenDB I was wondering if it was possible to use similar patterns but with the Windows Azure Table Storage Service instead of the file system as RavenDB is using.

A good storage format for our documents is the JSON representation. Using a serialization library like Json.Net it will be easy to convert our data in JSON.

Using the Table Storage Service also requires us to provide a PartionKey for our document, in a typical multi-tenant database we could use this to 'partition' the data per tenant.

A document in a document store is easy to retrieve when we know the key to fetch it directly, but sometime we don't have that information. We might also want to query documents using a filter expression. In a relational database filtering in a query is easy but in a document store it requires a bit more efforts. RavenDB let us define indexes we could use to filter documents in queries. With the Table Store Service we can use additional properties to store information we want to filter on, acting like the indexes.

In order to create a very light weight Document Store with the Table Storage Service I will use my ElasticTableEntity class from a previous post.

First let me show you my domain entities for this demo. A simple Project class which may have many Tasks associated to it.

public class Project
{
    public Guid Owner { get; set; }
    public Guid Id { get; set; }
    public string Name { get; set; }
    public DateTime StartDate { get; set; }
    public int Status { get; set; }
    public List<Task> Tasks { get; set; }
}
 
public class Task
{
    public string Name { get; set; }
    public bool IsCompleted { get; set; }
}

Now let's take a look a typical Repository implementation for the Projects. You will need both WindowsAzure.Storage and Newtonsoft.Json packages from NuGet for this part.

public class ProjectRepository
{
    private CloudTable table;
 
    public ProjectRepository()
    {
        var connectionString = "...";
        
        CloudStorageAccount storageAccount = 
            CloudStorageAccount.Parse(connectionString);

        var client = storageAccount.CreateCloudTableClient();
        this.table = client.GetTableReference("Project");
        this.table.CreateIfNotExists();
    }
 
    public void Insert(Project project)
    {
        project.Id = Guid.NewGuid();
        
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = project.Owner.ToString();
        entity.RowKey = project.Id.ToString();
        
        entity.Document = JsonConvert.SerializeObject(project, 
            Newtonsoft.Json.Formatting.Indented);
        
        // Additional fields for querying (indexes)
        entity.Name = project.Name;
        entity.StartDate = project.StartDate;
        entity.TotalTasks = project.Tasks.Count();
        
        this.table.Execute(TableOperation.Insert(entity));
    }
 
    public IEnumerable<Project> List(string partitionKey)
    {
        var query = new TableQuery<ElasticTableEntity>()
            .Select(new [] { "Document" })
            .Where(TableQuery.GenerateFilterCondition("PartitionKey", 
                QueryComparisons.Equal, partitionKey));
        
        dynamic entities = table.ExecuteQuery(query).ToList();
        foreach (var entity in entities)
        {
            var document = (string)entity.Document.StringValue;
            yield return JsonConvert.DeserializeObject<Project>(document);
        }
    }
 
    public IEnumerable<Project> ListWithTasks(string partitionKey)
    {
        var query = new TableQuery<ElasticTableEntity>()
            .Select(new [] { "Document" })
            .Where(TableQuery.CombineFilters(
                TableQuery.GenerateFilterCondition("PartitionKey", 
                    QueryComparisons.Equal, partitionKey),
                TableOperators.And,
                TableQuery.GenerateFilterConditionForInt("TotalTasks", 
                    QueryComparisons.GreaterThan, 0)));
        
        dynamic entities = table.ExecuteQuery(query).ToList();
        foreach (var entity in entities)
        {
            var document = (string)entity.Document.StringValue;
            yield return JsonConvert.DeserializeObject<Project>(document);
        }
    }

    public Project Load(string partitionKey, string rowKey)
    {
        var query = new TableQuery<ElasticTableEntity>()
            .Select(new [] { "Document" })
            .Where(TableQuery.CombineFilters(
                TableQuery.GenerateFilterCondition("PartitionKey", 
                    QueryComparisons.Equal, partitionKey),
                TableOperators.And,
                TableQuery.GenerateFilterCondition("RowKey", 
                    QueryComparisons.Equal, rowKey)));
        
        dynamic entity = table.ExecuteQuery(query).SingleOrDefault();
        if (entity != null)
        {
            var document = (string)entity.Document.StringValue;
            return JsonConvert.DeserializeObject<Project>(document);
        }
        
        return null;
    }
 
    public void Update(Project project)
    {
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = project.Owner.ToString();
        entity.RowKey = project.Id.ToString();
        entity.ETag = "*";
        
        entity.Document = JsonConvert.SerializeObject(project,
            Newtonsoft.Json.Formatting.Indented);
        
        // Additional fields for querying (indexes)
        entity.Name = project.Name;
        entity.StartDate = project.StartDate;
        entity.TotalTasks = project.Tasks.Count();
        
        this.table.Execute(TableOperation.Replace(entity));
    }
 
    public void Delete(Project project)
    {
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = project.Owner.ToString();
        entity.RowKey = project.Id.ToString();
        entity.ETag = "*";
        
        this.table.Execute(TableOperation.Delete(entity));
    }
 
    public void Delete(string partitionKey, string rowKey)
    {
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = partitionKey;
        entity.RowKey = rowKey;
        entity.ETag = "*";
        
        this.table.Execute(TableOperation.Delete(entity));
    }
}

We could refactor the code to reduce duplication but the point was to show you how to dynamically create a Document property to store the actual serialized document and how to handle the basic CRUD operations. You can also see how to dynamically add other properties to store extra information on the document. This is useful for the LoadWithTasks method which fetch only Projects with at least one Task on it.

Finally let's take a look at a few examples on how to use the ProjectRepository itself (in LinqPad in this case)...

private void Insert()
{
    var repo = new ProjectRepository();
    
    var project = new Project()
    {
        Owner = Guid.Parse("8ad82668-4b08-49c9-87ef-80870bfb4b85");
        Name = "My new project",
        StartDate = DateTime.Now,
        Status = 4,
        Tasks = new List<Task>()
        { 
            new Task { Name = "Task 1", IsCompleted = true }, 
            new Task { Name = "Task 2" } 
        }
    };
    
    repo.Insert(project);
}
 
private void List()
{
    var repo = new ProjectRepository();
    
    var projects = repo.List("static");
    projects.Dump();
}
 
private void Load()
{
    var repo = new ProjectRepository();
    var project = repo.Load("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");

    project.Dump();
}
 
private void Update()
{
    var repo = new ProjectRepository();
    var project = repo.Load("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");
    
    project.Name = "Modified name " + DateTime.Now.Ticks;
    repo.Update(project);
}
 
private void Delete()
{
    var repo = new ProjectRepository();
    var project = repo.Load("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");
    repo.Delete(project);
}
 
private void DeleteDirectly()
{
    var repo = new ProjectRepository();
    repo.Delete("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");
}

What I've shown you here is really basic and we do have some limitations like the fact that a serialized document can't be bigger than 64KB in size and we are limited to 251 extra properties (or indexes). Still, it is a good start for a prototype of a document store.

I'm currently working on a more self-contained library to help me use the Azure Table Storage Service as a Document Store. More on this in a future post.

If you want you can grab all the code (Gist) on GitHub here and here.

See also

- Using Azure Table Storage with dynamic table entities
- Using Azure Blob Storage to store documents

Tuesday, March 12, 2013

Using Azure Table Storage with dynamic table entities

I've been working with Windows Azure for a few months now and I was trying to figure out a way to use the Azure Table Storage Service with POCOs and complex types rather than only classes inheriting from the TableEntity base class. Turns out that the only thing the CloudTableClient cares about is the ITableEntity interface. DynamicTableEntity also implements ITableEntity but is only used for querying and updating entities. You can see it in action in the example on how to query only a subset of an entity's properties.

So I started to wonder if it was possible to create class that implements ITableEntity and offer the dynamic features of an ExpandoObject. After a bit of hacking around in LinqPad I have this solution.



In this snippet I also implemented the ICustomMemberProvider which is part of the LinqPad extensions API for queries (more on this here). In Visual Studio we'll need to remove that code.

We can now use the ElasticTableEntity class like this:



Please note that you need to use the dynamic keyword to be able to define properties dynamically. You can also use the entity indexer like I did with the LastName property.

Result
List<ElasticTableEntity> (1 item)
PartitionKeyRowKeyTimestampETagFirstNameNumberBoolDateTokenIdLastName
Partition12325203917875897660732013-03-13 1:00:40 AM +00:00W/"datetime'2013-03-13T01%3A00%3A40.619873Z'"Pascal34False1912-03-04 12:00:00 AM +00:0050604c02-f01c-48fc-862e-7ea66153f434Laurin
Result with projection
List<ElasticTableEntity> (1 item)
PartitionKeyRowKeyTimestampETagDateFirstName
Partition12325203917875897660732013-03-13 1:00:40 AM +00:00W/"datetime'2013-03-13T01%3A00%3A40.619873Z'"1912-03-04 12:00:00 AM +00:00Pascal

The ElasticTableEntity allows us to define properties at run time which will be added to the table when inserting the entities. Tables in the Azure Table Storage have flexible schema so we are free to store entities with different properties as long a we respect some limitations:
  • Entities can have no more than 252 different properties (that's for the Table)
  • An Entity's data can be up to 1 MB in size
  • A property must be one of the following types : byte[], bool, DateTime, double, Guid, int, long or string
  • A property value can be up to 64 KB in size (for string and byte array)
  • A property name is case sensitive and can be no more than 255 characters in length
You can store about any kind of data as long as it is one of the supported data type.  You could also encode other kind of date type in a byte array or a string (like a json document).  Just be careful to always stick to one data type for a property (yes, we can store like int, bool and string in the same column using different entities!)

That's it for now. Next time I'll show you how to use the Windows Azure Table Storage Service as a document-oriented database with the ElasticTableEntity.



See also

- Document oriented database with Azure Table Storage Service
- Using Azure Blob Storage to store documents

Sunday, February 3, 2013

Using MiniProfiler for Entity Framework

The most important thing to know when using an ORM for your database querying is what exactly happens behind the scene. Do you know if what you are doing will produce acceptable SQL? Do you know if the query will execute fast? Worst of all, do you know how many queries will be executed?

To help answer those questions we need to profile our ORM. Good tools like SQL Profiler, Entity Framework Profiler and NHibernate Profiler exists but you need to spend money for them.

On the free side of things, MiniProfiler is a library we can add to our project via a NuGet package for Entity Framework in this case. MiniProfiler was created for ASP.Net MVC applications in mind but we can still use it in a desktop application using another library called MiniProfiler.Window.

Here is a small example on how to use both libraries together:




This will output to the console the duration of the execution, the raw SQL and parameters of the query. Of course we'll want to refactor this code a bit but it gives you an idea on how MiniProfiler works.

Just a small warning, for reasons unknown to me mixing MiniProfiler and Entity Framework's database initialization like DropCreateDatabaseAlways is not working. As long as we are disabling the initialization with Database.SetInitializer<Context>(null) everything's working just fine.