The Never Ending Journey: Using Azure Blob Storage to store documents

Last time I wrote about Implementing a document oriented database with the Windows Azure Table Storage Service I was using the Table Storage Service to store serialized documents into an entity's property. While it is an easy way to store complex objects the table storage is usually meant to storage primitives like int, bool, date and simple strings. However there is another service in the Windows Azure Storage family who is better suited to store documents: the Blob Storage Service. The blob storage use the metaphor of files which is in essence documents.

Now I will adapt the Repository I did in my previous post to use the blob storage this time. I'll only walk through the changes I'm making here.

Constructor

First, we need to create a CloudBlobContainer instance in the constructor. Please note that for blob storage container names are required to be lower-case.

public class ProjectRepository
{
    private CloudTable table;
    private CloudBlobContainer container;
    
    public ProjectRepository()
    {
        var connectionString = "...";
        
        CloudStorageAccount storageAccount = 
            CloudStorageAccount.Parse(connectionString);

        var tableClient = storageAccount.CreateCloudTableClient();
        this.table = tableClient.GetTableReference("Project");
        this.table.CreateIfNotExists();
        
        var blobClient = storageAccount.CreateCloudBlobClient();
        this.container = blobClient.GetContainerReference("project");
        this.container.CreateIfNotExists();
    }
    // ...
}

Insert

Next for the Insert method, we no longer store the document in a property of the ElasticTableEntity object. Instead we want to serialize the document into the JSON format and upload it as a file to the blob storage and set the ContentType of that file to application/json. For the blob name (or path) the pattern I'm using looks like this: {document-type}/{partition-key}/{row-key}.

public void Insert(Project project)
{
    project.Id = Guid.NewGuid();
        
    var document = JsonConvert.SerializeObject(project,
        Newtonsoft.Json.Formatting.Indented);

    var partitionKey = project.Owner.ToString();
    var rowKey = project.Id.ToString();
 
    UploadDocument(partitionKey, rowKey, document);
  
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
  
    entity.Name = project.Name;
    entity.StartDate = project.StartDate;
    entity.TotalTasks = project.Tasks.Count();
  
    this.table.Execute(TableOperation.Insert(entity));
}

private void UploadDocument(string partitionKey, string rowKey, string document)
{
    var filename = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var blockBlob = this.container.GetBlockBlobReference(filename);
  
    using (var memory = new MemoryStream())
    using (var writer = new StreamWriter(memory))
    {
        writer.Write(document);
        writer.Flush();
        memory.Seek(0, SeekOrigin.Begin);
   
        blockBlob.UploadFromStream(memory);
    }
  
    blockBlob.Properties.ContentType = "application/json";
    blockBlob.SetProperties();
}

Load

For the Load method we can get the blob name using the PartitionKey and RowKey then download the document from blob storage. In DownloadDocument I'm using a MemoryStream and StreamReader to get the serialized document as a string.

public Project Load(string partitionKey, string rowKey)
{
    var blobName = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var document = this.DownloadDocument(blobName);
    return JsonConvert.DeserializeObject<Project>(document);
}

private string DownloadDocument(string blobName)
{
    var blockBlob = this.container.GetBlockBlobReference(blobName);
  
    using (var memory = new MemoryStream())
    using (var reader = new StreamReader(memory))
    {
        blockBlob.DownloadToStream(memory);
        memory.Seek(0, SeekOrigin.Begin);
   
        return reader.ReadToEnd();
    }
}

List

In the first List method we want to get all documents of the same partition. We can do that by directly using the ListBlobs method of CloudBlobDirectory. For the ListWithTasks method we still need to query the table storage first to know which documents contain at least one task. Then with the entities we'll know the RowKey value of those documents so we can simply call the Load method we just saw.

public IEnumerable<Project> List(string partitionKey)
{
    var listItems = this.container
        .GetDirectoryReference("project/" + partitionKey).ListBlobs();
  
    return listItems.OfType<CloudBlockBlob>()
        .Select(x => this.DownloadDocument(x.Name))
        .Select(document => JsonConvert.DeserializeObject<Project>(document));
}
    
public IEnumerable<Project> ListWithTasks(string partitionKey)
{
    var query = new TableQuery<ElasticTableEntity>()
        .Select(new [] { "RowKey" })
        .Where(TableQuery.CombineFilters(
            TableQuery.GenerateFilterCondition("PartitionKey", 
                QueryComparisons.Equal, partitionKey),
            TableOperators.And,
            TableQuery.GenerateFilterConditionForInt("TotalTasks", 
                QueryComparisons.GreaterThan, 0)));
       
    dynamic entities = table.ExecuteQuery(query).ToList();
 
    foreach (var entity in entities)
        yield return this.Load(partitionKey, entity.RowKey);
}

Update

To update a document now we also need to serialize and upload the new version to blob storage.

public void Update(Project project)
{
    var document = JsonConvert.SerializeObject(project, 
        Newtonsoft.Json.Formatting.Indented);

    var partitionKey = project.Owner.ToString();
    var rowKey = project.Id.ToString();
 
    UploadDocument(partitionKey, rowKey, document);
        
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
    entity.ETag = "*";
        
    entity.Name = project.Name;
    entity.StartDate = project.StartDate;
    entity.TotalTasks = project.Tasks != null ? project.Tasks.Count() : 0;
        
    this.table.Execute(TableOperation.Replace(entity));
}

Delete

Finally, deleting a document now requires us to call Delete on the CloudBlobContainer reference.

public void Delete(Project project)
{
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = project.Owner.ToString();
    entity.RowKey = project.Id.ToString();
    entity.ETag = "*";
        
    this.table.Execute(TableOperation.Delete(entity));
    
    this.DeleteDocument(entity.PartitionKey, entity.RowKey);
}
 
public void Delete(string partitionKey, string rowKey)
{
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
    entity.ETag = "*";
        
    this.table.Execute(TableOperation.Delete(entity));
 
    this.DeleteDocument(partitionKey, rowKey);
}

private void DeleteDocument(string partitionKey, string rowKey)
{
    var blobName = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var blockBlob = this.container.GetBlockBlobReference(blobName);
    blockBlob.Delete(DeleteSnapshotsOption.IncludeSnapshots);
}

Conclusion

Using both Tables and Blobs Storage Services we can get the best of both worlds. We can query for document's properties with table storage and we can store documents larger than 64KB in blob storage. Of course now almost all operations on my Repository requires two calls to Azure. Currently those are done sequentially, waiting for the first call to complete before the doing the second call. I should fix that by using the asynchronous variants of storage service methods like the BeginDelete/EndDelete method pair on CloudBlobContainer.

I hope this post is giving you ideas on new and clever ways you can use the Windows Azure Storage Services in your projects.

The Never Ending Journey

Thursday, May 9, 2013

Using Azure Blob Storage to store documents