Tuesday, April 9, 2013

Document oriented database with Azure Table Storage Service

What is a Document database?


A document store or document oriented database like RavenDB is a kind of NoSQL database where we store semi structured data in documents and use a key to retrieve existing documents.

The difference with a relational database is that a complex data structure needs to be represented by entities and relations in a RDMS which means using many tables, joints and constraints. In a document database the whole graph of entities is stored as a single document. Of course we still need to handle some relations between documents or graphs of entities. This is done by storing other documents key inside the document and acting like a foreign keys.

Another way to see documents is to think that all tables related together with a delete cascade constraints on the relations are part of the same document. If a piece of data from a table can only exists if related data from another table also exists it means both should be part of the same document.

The concept of document relates well with Aggregates of Domain Driven Design.

Implementing a document database with the Azure Table Storage Service


Looking at RavenDB I was wondering if it was possible to use similar patterns but with the Windows Azure Table Storage Service instead of the file system as RavenDB is using.

A good storage format for our documents is the JSON representation. Using a serialization library like Json.Net it will be easy to convert our data in JSON.

Using the Table Storage Service also requires us to provide a PartionKey for our document, in a typical multi-tenant database we could use this to 'partition' the data per tenant.

A document in a document store is easy to retrieve when we know the key to fetch it directly, but sometime we don't have that information. We might also want to query documents using a filter expression. In a relational database filtering in a query is easy but in a document store it requires a bit more efforts. RavenDB let us define indexes we could use to filter documents in queries. With the Table Store Service we can use additional properties to store information we want to filter on, acting like the indexes.

In order to create a very light weight Document Store with the Table Storage Service I will use my ElasticTableEntity class from a previous post.

First let me show you my domain entities for this demo. A simple Project class which may have many Tasks associated to it.

public class Project
{
    public Guid Owner { get; set; }
    public Guid Id { get; set; }
    public string Name { get; set; }
    public DateTime StartDate { get; set; }
    public int Status { get; set; }
    public List<Task> Tasks { get; set; }
}
 
public class Task
{
    public string Name { get; set; }
    public bool IsCompleted { get; set; }
}

Now let's take a look a typical Repository implementation for the Projects. You will need both WindowsAzure.Storage and Newtonsoft.Json packages from NuGet for this part.

public class ProjectRepository
{
    private CloudTable table;
 
    public ProjectRepository()
    {
        var connectionString = "...";
        
        CloudStorageAccount storageAccount = 
            CloudStorageAccount.Parse(connectionString);

        var client = storageAccount.CreateCloudTableClient();
        this.table = client.GetTableReference("Project");
        this.table.CreateIfNotExists();
    }
 
    public void Insert(Project project)
    {
        project.Id = Guid.NewGuid();
        
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = project.Owner.ToString();
        entity.RowKey = project.Id.ToString();
        
        entity.Document = JsonConvert.SerializeObject(project, 
            Newtonsoft.Json.Formatting.Indented);
        
        // Additional fields for querying (indexes)
        entity.Name = project.Name;
        entity.StartDate = project.StartDate;
        entity.TotalTasks = project.Tasks.Count();
        
        this.table.Execute(TableOperation.Insert(entity));
    }
 
    public IEnumerable<Project> List(string partitionKey)
    {
        var query = new TableQuery<ElasticTableEntity>()
            .Select(new [] { "Document" })
            .Where(TableQuery.GenerateFilterCondition("PartitionKey", 
                QueryComparisons.Equal, partitionKey));
        
        dynamic entities = table.ExecuteQuery(query).ToList();
        foreach (var entity in entities)
        {
            var document = (string)entity.Document.StringValue;
            yield return JsonConvert.DeserializeObject<Project>(document);
        }
    }
 
    public IEnumerable<Project> ListWithTasks(string partitionKey)
    {
        var query = new TableQuery<ElasticTableEntity>()
            .Select(new [] { "Document" })
            .Where(TableQuery.CombineFilters(
                TableQuery.GenerateFilterCondition("PartitionKey", 
                    QueryComparisons.Equal, partitionKey),
                TableOperators.And,
                TableQuery.GenerateFilterConditionForInt("TotalTasks", 
                    QueryComparisons.GreaterThan, 0)));
        
        dynamic entities = table.ExecuteQuery(query).ToList();
        foreach (var entity in entities)
        {
            var document = (string)entity.Document.StringValue;
            yield return JsonConvert.DeserializeObject<Project>(document);
        }
    }

    public Project Load(string partitionKey, string rowKey)
    {
        var query = new TableQuery<ElasticTableEntity>()
            .Select(new [] { "Document" })
            .Where(TableQuery.CombineFilters(
                TableQuery.GenerateFilterCondition("PartitionKey", 
                    QueryComparisons.Equal, partitionKey),
                TableOperators.And,
                TableQuery.GenerateFilterCondition("RowKey", 
                    QueryComparisons.Equal, rowKey)));
        
        dynamic entity = table.ExecuteQuery(query).SingleOrDefault();
        if (entity != null)
        {
            var document = (string)entity.Document.StringValue;
            return JsonConvert.DeserializeObject<Project>(document);
        }
        
        return null;
    }
 
    public void Update(Project project)
    {
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = project.Owner.ToString();
        entity.RowKey = project.Id.ToString();
        entity.ETag = "*";
        
        entity.Document = JsonConvert.SerializeObject(project,
            Newtonsoft.Json.Formatting.Indented);
        
        // Additional fields for querying (indexes)
        entity.Name = project.Name;
        entity.StartDate = project.StartDate;
        entity.TotalTasks = project.Tasks.Count();
        
        this.table.Execute(TableOperation.Replace(entity));
    }
 
    public void Delete(Project project)
    {
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = project.Owner.ToString();
        entity.RowKey = project.Id.ToString();
        entity.ETag = "*";
        
        this.table.Execute(TableOperation.Delete(entity));
    }
 
    public void Delete(string partitionKey, string rowKey)
    {
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = partitionKey;
        entity.RowKey = rowKey;
        entity.ETag = "*";
        
        this.table.Execute(TableOperation.Delete(entity));
    }
}

We could refactor the code to reduce duplication but the point was to show you how to dynamically create a Document property to store the actual serialized document and how to handle the basic CRUD operations. You can also see how to dynamically add other properties to store extra information on the document. This is useful for the LoadWithTasks method which fetch only Projects with at least one Task on it.

Finally let's take a look at a few examples on how to use the ProjectRepository itself (in LinqPad in this case)...

private void Insert()
{
    var repo = new ProjectRepository();
    
    var project = new Project()
    {
        Owner = Guid.Parse("8ad82668-4b08-49c9-87ef-80870bfb4b85");
        Name = "My new project",
        StartDate = DateTime.Now,
        Status = 4,
        Tasks = new List<Task>()
        { 
            new Task { Name = "Task 1", IsCompleted = true }, 
            new Task { Name = "Task 2" } 
        }
    };
    
    repo.Insert(project);
}
 
private void List()
{
    var repo = new ProjectRepository();
    
    var projects = repo.List("static");
    projects.Dump();
}
 
private void Load()
{
    var repo = new ProjectRepository();
    var project = repo.Load("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");

    project.Dump();
}
 
private void Update()
{
    var repo = new ProjectRepository();
    var project = repo.Load("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");
    
    project.Name = "Modified name " + DateTime.Now.Ticks;
    repo.Update(project);
}
 
private void Delete()
{
    var repo = new ProjectRepository();
    var project = repo.Load("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");
    repo.Delete(project);
}
 
private void DeleteDirectly()
{
    var repo = new ProjectRepository();
    repo.Delete("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");
}

What I've shown you here is really basic and we do have some limitations like the fact that a serialized document can't be bigger than 64KB in size and we are limited to 251 extra properties (or indexes). Still, it is a good start for a prototype of a document store.

I'm currently working on a more self-contained library to help me use the Azure Table Storage Service as a Document Store. More on this in a future post.

If you want you can grab all the code (Gist) on GitHub here and here.

See also

- Using Azure Table Storage with dynamic table entities
- Using Azure Blob Storage to store documents

6 comments:

Anonymous said...

Great post, thank you. Can you explain why the max size is 64 KB but in the next post, you say the max size is 1 MB?

Thanks.

Unknown said...

64KB is the size limit for one property. 1MB is for all properties of one table entity.

To support documents larger than 64KB I would need to split the JSON data into multiples properties (Document2, Document3, ...) and handle that in the deserialization part too.

Anonymous said...

Ah, I see. Thanks for the response. Keep up the great work.

Robert said...
This comment has been removed by a blog administrator.
Richard C. Lambert said...

This is done by storing other documents key inside the document and acting like a foreign keys.local paper shredding

Tejuteju said...

very informative blog and useful article thank you for sharing with us, keep posting learn more Azure Online Training