Thursday, May 9, 2013

Using Azure Blob Storage to store documents

Last time I wrote about Implementing a document oriented database with the Windows Azure Table Storage Service I was using the Table Storage Service to store serialized documents into an entity's property. While it is an easy way to store complex objects the table storage is usually meant to storage primitives like int, bool, date and simple strings. However there is another service in the Windows Azure Storage family who is better suited to store documents: the Blob Storage Service. The blob storage use the metaphor of files which is in essence documents.

Now I will adapt the Repository I did in my previous post to use the blob storage this time. I'll only walk through the changes I'm making here.

Constructor


First, we need to create a CloudBlobContainer instance in the constructor. Please note that for blob storage container names are required to be lower-case.

public class ProjectRepository
{
    private CloudTable table;
    private CloudBlobContainer container;
    
    public ProjectRepository()
    {
        var connectionString = "...";
        
        CloudStorageAccount storageAccount = 
            CloudStorageAccount.Parse(connectionString);

        var tableClient = storageAccount.CreateCloudTableClient();
        this.table = tableClient.GetTableReference("Project");
        this.table.CreateIfNotExists();
        
        var blobClient = storageAccount.CreateCloudBlobClient();
        this.container = blobClient.GetContainerReference("project");
        this.container.CreateIfNotExists();
    }
    // ...
}

Insert


Next for the Insert method, we no longer store the document in a property of the ElasticTableEntity object. Instead we want to serialize the document into the JSON format and upload it as a file to the blob storage and set the ContentType of that file to application/json. For the blob name (or path) the pattern I'm using looks like this: {document-type}/{partition-key}/{row-key}.

public void Insert(Project project)
{
    project.Id = Guid.NewGuid();
        
    var document = JsonConvert.SerializeObject(project,
        Newtonsoft.Json.Formatting.Indented);

    var partitionKey = project.Owner.ToString();
    var rowKey = project.Id.ToString();
 
    UploadDocument(partitionKey, rowKey, document);
  
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
  
    entity.Name = project.Name;
    entity.StartDate = project.StartDate;
    entity.TotalTasks = project.Tasks.Count();
  
    this.table.Execute(TableOperation.Insert(entity));
}

private void UploadDocument(string partitionKey, string rowKey, string document)
{
    var filename = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var blockBlob = this.container.GetBlockBlobReference(filename);
  
    using (var memory = new MemoryStream())
    using (var writer = new StreamWriter(memory))
    {
        writer.Write(document);
        writer.Flush();
        memory.Seek(0, SeekOrigin.Begin);
   
        blockBlob.UploadFromStream(memory);
    }
  
    blockBlob.Properties.ContentType = "application/json";
    blockBlob.SetProperties();
}

Load


For the Load method we can get the blob name using the PartitionKey and RowKey then download the document from blob storage. In DownloadDocument I'm using a MemoryStream and StreamReader to get the serialized document as a string.

public Project Load(string partitionKey, string rowKey)
{
    var blobName = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var document = this.DownloadDocument(blobName);
    return JsonConvert.DeserializeObject<Project>(document);
}

private string DownloadDocument(string blobName)
{
    var blockBlob = this.container.GetBlockBlobReference(blobName);
  
    using (var memory = new MemoryStream())
    using (var reader = new StreamReader(memory))
    {
        blockBlob.DownloadToStream(memory);
        memory.Seek(0, SeekOrigin.Begin);
   
        return reader.ReadToEnd();
    }
}

List


In the first List method we want to get all documents of the same partition. We can do that by directly using the ListBlobs method of CloudBlobDirectory. For the ListWithTasks method we still need to query the table storage first to know which documents contain at least one task. Then with the entities we'll know the RowKey value of those documents so we can simply call the Load method we just saw.

public IEnumerable<Project> List(string partitionKey)
{
    var listItems = this.container
        .GetDirectoryReference("project/" + partitionKey).ListBlobs();
  
    return listItems.OfType<CloudBlockBlob>()
        .Select(x => this.DownloadDocument(x.Name))
        .Select(document => JsonConvert.DeserializeObject<Project>(document));
}
    
public IEnumerable<Project> ListWithTasks(string partitionKey)
{
    var query = new TableQuery<ElasticTableEntity>()
        .Select(new [] { "RowKey" })
        .Where(TableQuery.CombineFilters(
            TableQuery.GenerateFilterCondition("PartitionKey", 
                QueryComparisons.Equal, partitionKey),
            TableOperators.And,
            TableQuery.GenerateFilterConditionForInt("TotalTasks", 
                QueryComparisons.GreaterThan, 0)));
       
    dynamic entities = table.ExecuteQuery(query).ToList();
 
    foreach (var entity in entities)
        yield return this.Load(partitionKey, entity.RowKey);
}

Update


To update a document now we also need to serialize and upload the new version to blob storage.

public void Update(Project project)
{
    var document = JsonConvert.SerializeObject(project, 
        Newtonsoft.Json.Formatting.Indented);

    var partitionKey = project.Owner.ToString();
    var rowKey = project.Id.ToString();
 
    UploadDocument(partitionKey, rowKey, document);
        
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
    entity.ETag = "*";
        
    entity.Name = project.Name;
    entity.StartDate = project.StartDate;
    entity.TotalTasks = project.Tasks != null ? project.Tasks.Count() : 0;
        
    this.table.Execute(TableOperation.Replace(entity));
}

Delete


Finally, deleting a document now requires us to call Delete on the CloudBlobContainer reference.

public void Delete(Project project)
{
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = project.Owner.ToString();
    entity.RowKey = project.Id.ToString();
    entity.ETag = "*";
        
    this.table.Execute(TableOperation.Delete(entity));
    
    this.DeleteDocument(entity.PartitionKey, entity.RowKey);
}
 
public void Delete(string partitionKey, string rowKey)
{
    dynamic entity = new ElasticTableEntity();
    entity.PartitionKey = partitionKey;
    entity.RowKey = rowKey;
    entity.ETag = "*";
        
    this.table.Execute(TableOperation.Delete(entity));
 
    this.DeleteDocument(partitionKey, rowKey);
}

private void DeleteDocument(string partitionKey, string rowKey)
{
    var blobName = string.Format(@"project\{0}\{1}.json", partitionKey, rowKey);
    var blockBlob = this.container.GetBlockBlobReference(blobName);
    blockBlob.Delete(DeleteSnapshotsOption.IncludeSnapshots);
}

Conclusion


Using both Tables and Blobs Storage Services we can get the best of both worlds. We can query for document's properties with table storage and we can store documents larger than 64KB in blob storage. Of course now almost all operations on my Repository requires two calls to Azure. Currently those are done sequentially, waiting for the first call to complete before the doing the second call. I should fix that by using the asynchronous variants of storage service methods like the BeginDelete/EndDelete method pair on CloudBlobContainer.

I hope this post is giving you ideas on new and clever ways you can use the Windows Azure Storage Services in your projects.

See also

- Using Azure Table Storage with dynamic table entities
- Document oriented database with Azure Table Storage Service

2 comments:

Anonymous said...

can this blobstorage+table storage service of azure be used as a a replacement for the Mongodb. i observed that the mongodb packages on azure (even aws) are quite expensive in comparison to the blob storage. Is it logically possible to convert all non structured data to a table storage format .there by completely getting rid of any mongo or nosql db on my website. how is the performance in comparison to a dedicated mongodb. any answers

Pascal Laurin said...

Hi Sasijanth, honestly I wouldn't use this code on any serious project. It is a nice proof of concept and shows how to implement a "poor man's" document database. Theoretically using this with TableStorage to maintain secondary index you could scale to thousands of documents but it will never match the likes of MongoDB, RavenDB or Azure's DocumentDB in features, reliability and performance. I'm pretty sure this implementation is slower than what the others can offer.