The Never Ending Journey: Document database

Showing posts with label Document database. Show all posts

Thursday, May 9, 2013

Using Azure Blob Storage to store documents

Last time I wrote about Implementing a document oriented database with the Windows Azure Table Storage Service I was using the Table Storage Service to store serialized documents into an entity's property. While it is an easy way to store complex objects the table storage is usually meant to storage primitives like int, bool, date and simple strings. However there is another service in the Windows Azure Storage family who is better suited to store documents: the Blob Storage Service. The blob storage use the metaphor of files which is in essence documents.

Now I will adapt the Repository I did in my previous post to use the blob storage this time. I'll only walk through the changes I'm making here.

Constructor

First, we need to create a CloudBlobContainer instance in the constructor. Please note that for blob storage container names are required to be lower-case.

public class ProjectRepository
{
    private CloudTable table;
    private CloudBlobContainer container;
    
    public ProjectRepository()
    {
        var connectionString = "...";
        
        CloudStorageAccount storageAccount = 
            CloudStorageAccount.Parse(connectionString);

        var tableClient = storageAccount.CreateCloudTableClient();
        this.table = tableClient.GetTableReference("Project");
        this.table.CreateIfNotExists();
        
        var blobClient = storageAccount.CreateCloudBlobClient();
        this.container = blobClient.GetContainerReference("project");
        this.container.CreateIfNotExists();
    }
    // ...
}

Insert

Next for the Insert method, we no longer store the document in a property of the ElasticTableEntity object. Instead we want to serialize the document into the JSON format and upload it as a file to the blob storage and set the ContentType of that file to application/json. For the blob name (or path) the pattern I'm using looks like this: {document-type}/{partition-key}/{row-key}.

public void Insert(Project project)
{
    project.Id = Guid.NewGuid();
        
    var document = JsonConvert.SerializeObject(project,
        Newtonsoft.Json.Formatting.Indented);

    var partitionKey = project.Owner.ToString();
    var rowKey = project.Id.ToString();
 
    UploadDocument(partitionKey, rowKey, document);
  
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
  
    entity.Name = project.Name;
    entity.StartDate = project.StartDate;
    entity.TotalTasks = project.Tasks.Count();
  
    this.table.Execute(TableOperation.Insert(entity));
}

private void UploadDocument(string partitionKey, string rowKey, string document)
{
    var filename = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var blockBlob = this.container.GetBlockBlobReference(filename);
  
    using (var memory = new MemoryStream())
    using (var writer = new StreamWriter(memory))
    {
        writer.Write(document);
        writer.Flush();
        memory.Seek(0, SeekOrigin.Begin);
   
        blockBlob.UploadFromStream(memory);
    }
  
    blockBlob.Properties.ContentType = "application/json";
    blockBlob.SetProperties();
}

Load

For the Load method we can get the blob name using the PartitionKey and RowKey then download the document from blob storage. In DownloadDocument I'm using a MemoryStream and StreamReader to get the serialized document as a string.

public Project Load(string partitionKey, string rowKey)
{
    var blobName = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var document = this.DownloadDocument(blobName);
    return JsonConvert.DeserializeObject<Project>(document);
}

private string DownloadDocument(string blobName)
{
    var blockBlob = this.container.GetBlockBlobReference(blobName);
  
    using (var memory = new MemoryStream())
    using (var reader = new StreamReader(memory))
    {
        blockBlob.DownloadToStream(memory);
        memory.Seek(0, SeekOrigin.Begin);
   
        return reader.ReadToEnd();
    }
}

List

In the first List method we want to get all documents of the same partition. We can do that by directly using the ListBlobs method of CloudBlobDirectory. For the ListWithTasks method we still need to query the table storage first to know which documents contain at least one task. Then with the entities we'll know the RowKey value of those documents so we can simply call the Load method we just saw.

public IEnumerable<Project> List(string partitionKey)
{
    var listItems = this.container
        .GetDirectoryReference("project/" + partitionKey).ListBlobs();
  
    return listItems.OfType<CloudBlockBlob>()
        .Select(x => this.DownloadDocument(x.Name))
        .Select(document => JsonConvert.DeserializeObject<Project>(document));
}
    
public IEnumerable<Project> ListWithTasks(string partitionKey)
{
    var query = new TableQuery<ElasticTableEntity>()
        .Select(new [] { "RowKey" })
        .Where(TableQuery.CombineFilters(
            TableQuery.GenerateFilterCondition("PartitionKey", 
                QueryComparisons.Equal, partitionKey),
            TableOperators.And,
            TableQuery.GenerateFilterConditionForInt("TotalTasks", 
                QueryComparisons.GreaterThan, 0)));
       
    dynamic entities = table.ExecuteQuery(query).ToList();
 
    foreach (var entity in entities)
        yield return this.Load(partitionKey, entity.RowKey);
}

Update

To update a document now we also need to serialize and upload the new version to blob storage.

public void Update(Project project)
{
    var document = JsonConvert.SerializeObject(project, 
        Newtonsoft.Json.Formatting.Indented);

    var partitionKey = project.Owner.ToString();
    var rowKey = project.Id.ToString();
 
    UploadDocument(partitionKey, rowKey, document);
        
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
    entity.ETag = "*";
        
    entity.Name = project.Name;
    entity.StartDate = project.StartDate;
    entity.TotalTasks = project.Tasks != null ? project.Tasks.Count() : 0;
        
    this.table.Execute(TableOperation.Replace(entity));
}

Delete

Finally, deleting a document now requires us to call Delete on the CloudBlobContainer reference.

public void Delete(Project project)
{
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = project.Owner.ToString();
    entity.RowKey = project.Id.ToString();
    entity.ETag = "*";
        
    this.table.Execute(TableOperation.Delete(entity));
    
    this.DeleteDocument(entity.PartitionKey, entity.RowKey);
}
 
public void Delete(string partitionKey, string rowKey)
{
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
    entity.ETag = "*";
        
    this.table.Execute(TableOperation.Delete(entity));
 
    this.DeleteDocument(partitionKey, rowKey);
}

private void DeleteDocument(string partitionKey, string rowKey)
{
    var blobName = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var blockBlob = this.container.GetBlockBlobReference(blobName);
    blockBlob.Delete(DeleteSnapshotsOption.IncludeSnapshots);
}

Conclusion

Using both Tables and Blobs Storage Services we can get the best of both worlds. We can query for document's properties with table storage and we can store documents larger than 64KB in blob storage. Of course now almost all operations on my Repository requires two calls to Azure. Currently those are done sequentially, waiting for the first call to complete before the doing the second call. I should fix that by using the asynchronous variants of storage service methods like the BeginDelete/EndDelete method pair on CloudBlobContainer.

I hope this post is giving you ideas on new and clever ways you can use the Windows Azure Storage Services in your projects.

Tuesday, April 9, 2013

Document oriented database with Azure Table Storage Service

What is a Document database?

A document store or document oriented database like RavenDB is a kind of NoSQL database where we store semi structured data in documents and use a key to retrieve existing documents.

The difference with a relational database is that a complex data structure needs to be represented by entities and relations in a RDMS which means using many tables, joints and constraints. In a document database the whole graph of entities is stored as a single document. Of course we still need to handle some relations between documents or graphs of entities. This is done by storing other documents key inside the document and acting like a foreign keys.

Another way to see documents is to think that all tables related together with a delete cascade constraints on the relations are part of the same document. If a piece of data from a table can only exists if related data from another table also exists it means both should be part of the same document.

The concept of document relates well with Aggregates of Domain Driven Design.

Implementing a document database with the Azure Table Storage Service

Looking at RavenDB I was wondering if it was possible to use similar patterns but with the Windows Azure Table Storage Service instead of the file system as RavenDB is using.

A good storage format for our documents is the JSON representation. Using a serialization library like Json.Net it will be easy to convert our data in JSON.

Using the Table Storage Service also requires us to provide a PartionKey for our document, in a typical multi-tenant database we could use this to 'partition' the data per tenant.

A document in a document store is easy to retrieve when we know the key to fetch it directly, but sometime we don't have that information. We might also want to query documents using a filter expression. In a relational database filtering in a query is easy but in a document store it requires a bit more efforts. RavenDB let us define indexes we could use to filter documents in queries. With the Table Store Service we can use additional properties to store information we want to filter on, acting like the indexes.

In order to create a very light weight Document Store with the Table Storage Service I will use my ElasticTableEntity class from a previous post.

First let me show you my domain entities for this demo. A simple Project class which may have many Tasks associated to it.

public class Project
{
    public Guid Owner { get; set; }
    public Guid Id { get; set; }
    public string Name { get; set; }
    public DateTime StartDate { get; set; }
    public int Status { get; set; }
    public List<Task> Tasks { get; set; }
}
 
public class Task
{
    public string Name { get; set; }
    public bool IsCompleted { get; set; }
}

Now let's take a look a typical Repository implementation for the Projects. You will need both WindowsAzure.Storage and Newtonsoft.Json packages from NuGet for this part.

public class ProjectRepository
{
    private CloudTable table;
 
    public ProjectRepository()
    {
        var connectionString = "...";
        
        CloudStorageAccount storageAccount = 
            CloudStorageAccount.Parse(connectionString);

        var client = storageAccount.CreateCloudTableClient();
        this.table = client.GetTableReference("Project");
        this.table.CreateIfNotExists();
    }
 
    public void Insert(Project project)
    {
        project.Id = Guid.NewGuid();
        
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = project.Owner.ToString();
        entity.RowKey = project.Id.ToString();
        
        entity.Document = JsonConvert.SerializeObject(project, 
            Newtonsoft.Json.Formatting.Indented);
        
        // Additional fields for querying (indexes)
        entity.Name = project.Name;
        entity.StartDate = project.StartDate;
        entity.TotalTasks = project.Tasks.Count();
        
        this.table.Execute(TableOperation.Insert(entity));
    }
 
    public IEnumerable<Project> List(string partitionKey)
    {
        var query = new TableQuery<ElasticTableEntity>()
            .Select(new [] { "Document" })
            .Where(TableQuery.GenerateFilterCondition("PartitionKey", 
                QueryComparisons.Equal, partitionKey));
        
        dynamic entities = table.ExecuteQuery(query).ToList();
        foreach (var entity in entities)
        {
            var document = (string)entity.Document.StringValue;
            yield return JsonConvert.DeserializeObject<Project>(document);
        }
    }
 
    public IEnumerable<Project> ListWithTasks(string partitionKey)
    {
        var query = new TableQuery<ElasticTableEntity>()
            .Select(new [] { "Document" })
            .Where(TableQuery.CombineFilters(
                TableQuery.GenerateFilterCondition("PartitionKey", 
                    QueryComparisons.Equal, partitionKey),
                TableOperators.And,
                TableQuery.GenerateFilterConditionForInt("TotalTasks", 
                    QueryComparisons.GreaterThan, 0)));
        
        dynamic entities = table.ExecuteQuery(query).ToList();
        foreach (var entity in entities)
        {
            var document = (string)entity.Document.StringValue;
            yield return JsonConvert.DeserializeObject<Project>(document);
        }
    }

    public Project Load(string partitionKey, string rowKey)
    {
        var query = new TableQuery<ElasticTableEntity>()
            .Select(new [] { "Document" })
            .Where(TableQuery.CombineFilters(
                TableQuery.GenerateFilterCondition("PartitionKey", 
                    QueryComparisons.Equal, partitionKey),
                TableOperators.And,
                TableQuery.GenerateFilterCondition("RowKey", 
                    QueryComparisons.Equal, rowKey)));
        
        dynamic entity = table.ExecuteQuery(query).SingleOrDefault();
        if (entity != null)
        {
            var document = (string)entity.Document.StringValue;
            return JsonConvert.DeserializeObject<Project>(document);
        }
        
        return null;
    }
 
    public void Update(Project project)
    {
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = project.Owner.ToString();
        entity.RowKey = project.Id.ToString();
        entity.ETag = "*";
        
        entity.Document = JsonConvert.SerializeObject(project,
            Newtonsoft.Json.Formatting.Indented);
        
        // Additional fields for querying (indexes)
        entity.Name = project.Name;
        entity.StartDate = project.StartDate;
        entity.TotalTasks = project.Tasks.Count();
        
        this.table.Execute(TableOperation.Replace(entity));
    }
 
    public void Delete(Project project)
    {
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = project.Owner.ToString();
        entity.RowKey = project.Id.ToString();
        entity.ETag = "*";
        
        this.table.Execute(TableOperation.Delete(entity));
    }
 
    public void Delete(string partitionKey, string rowKey)
    {
        dynamic entity = new ElasticTableEntity();
        entity.PartitionKey = partitionKey;
        entity.RowKey = rowKey;
        entity.ETag = "*";
        
        this.table.Execute(TableOperation.Delete(entity));
    }
}

We could refactor the code to reduce duplication but the point was to show you how to dynamically create a Document property to store the actual serialized document and how to handle the basic CRUD operations. You can also see how to dynamically add other properties to store extra information on the document. This is useful for the LoadWithTasks method which fetch only Projects with at least one Task on it.

Finally let's take a look at a few examples on how to use the ProjectRepository itself (in LinqPad in this case)...

private void Insert()
{
    var repo = new ProjectRepository();
    
    var project = new Project()
    {
        Owner = Guid.Parse("8ad82668-4b08-49c9-87ef-80870bfb4b85");
        Name = "My new project",
        StartDate = DateTime.Now,
        Status = 4,
        Tasks = new List<Task>()
        { 
            new Task { Name = "Task 1", IsCompleted = true }, 
            new Task { Name = "Task 2" } 
        }
    };
    
    repo.Insert(project);
}
 
private void List()
{
    var repo = new ProjectRepository();
    
    var projects = repo.List("static");
    projects.Dump();
}
 
private void Load()
{
    var repo = new ProjectRepository();
    var project = repo.Load("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");

    project.Dump();
}
 
private void Update()
{
    var repo = new ProjectRepository();
    var project = repo.Load("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");
    
    project.Name = "Modified name " + DateTime.Now.Ticks;
    repo.Update(project);
}
 
private void Delete()
{
    var repo = new ProjectRepository();
    var project = repo.Load("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");
    repo.Delete(project);
}
 
private void DeleteDirectly()
{
    var repo = new ProjectRepository();
    repo.Delete("8ad82668-4b08-49c9-87ef-80870bfb4b85", "c7d5f59c-72da-48de-83ca-265d8609ec02");
}

What I've shown you here is really basic and we do have some limitations like the fact that a serialized document can't be bigger than 64KB in size and we are limited to 251 extra properties (or indexes). Still, it is a good start for a prototype of a document store.

I'm currently working on a more self-contained library to help me use the Azure Table Storage Service as a Document Store. More on this in a future post.

If you want you can grab all the code (Gist) on GitHub here and here.

The Never Ending Journey