Wednesday, August 28, 2013

Windows Azure Caching and transient faults

When using remote services over the wire we should always plan for transient failures. Windows Azure Caching like any services in an Azure world is prone to such problem. Out of the box making calls to the cache server will fail from time to time due to network issues. Typically you will get those kind of exceptions:
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later.
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out.
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0016>:SubStatus<ES0001>:The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server.

For that reason it is a best practice to implement some kind of retry logic around your code calling the cache server. We could have used the Transient Application Block to manage that. But a few months ago I found somewhere that from time to time the DataCache object lose it's internal connection to the cache server. A simple way to fix this is to re-create a DataCache instance and retry the operation.

In the implementation below I'm keeping a reference to the DataCacheFactory and DataCache objects (another best practice). The CreateDataCache factory method will come handy later.

public class CachingService
{
    private DataCacheFactory cacheFactory;
    private DataCache cache;
   
    private DataCache Cache
    {
        get
        {
            if (this.cache == null)
            {
                this.CreateDataCache();
            }

            return this.cache;
        }
    }

    private void CreateDataCache()
    {
        this.cacheFactory = new DataCacheFactory();
        this.cache = this.cacheFactory.GetDefaultCache();
    }

    // ...
}

Then I have this SafeCallFunction I use whenever I want to work with the DataCache object. Notice that the only thing I do to retry the operation is to call the factory method to re-create the DataCache object.

private object SafeCallFunction(Func<object> function)
{
    try
    {
        return function.Invoke();
    }
    catch (DataCacheException)
    {
        // Retry by first re-creating the DataCache
        try
        {
            this.CreateDataCache();
            return function.Invoke();
        }
        catch (DataCacheException)
        {
            // Log error
        }
    }

    return null;
}

Finally in the rest of the class I can use the SafeCallFunction like this
public object CacheGet(string key)
{
    return this.SafeCallFunction(() => this.Cache.Get(key));
}

public void CachePut(string key, object cacheObject)
{
    this.SafeCallFunction(() => this.Cache.Put(key, cacheObject));
}

public void CacheRemove(string key)
{
    this.SafeCallFunction(() => this.Cache.Remove(key));
}

So far after a few weeks of using this implementation the single retry never failed on us. Before that we had around 5-10 failures daily for about 500k calls to the cache server. I would still recommend using a more robust retry policy with Windows Azure Caching but I think it's interesting to know that simply instantiating a new DataCache can fix most failures.

References

Caching in Windows Azure
Best Practices for using Windows Azure Cache
Optimization Guidance for Windows Azure Caching
The Transient Fault Handling Application Block