Showing posts with label Architecture. Show all posts

Wednesday, July 31, 2013

Using Windows Azure Caching efficiently across multiple Cloud Service roles

Windows Azure Caching is a great way to improve performance of your Azure application at no additional cost. The cache is running alongside your application in Cloud Service roles. The only thing you need to decide is how much memory of the role you want use for it (for co-located cache role). You can also dedicate the all memory of a role to caching if you want (with dedicated cache role). Roles that host caching are called cache clusters.

Starting with Azure Caching is so easy that it can be a while before you fully understand the best way to use it. On a recent project my first tough was to enable caching on all the Cloud Service roles as co-located service. This was causing us problems.

First of all, been a developer I debug my application using the local Azure compute emulator. The emulator runs one cache service for each instances of roles with cache clusters. The application has 2 web roles and 1 worker role so when I start a debugging session with multiple instances per role I need a lot of memory to run everything. More importantly, cache clusters do not share cached data between each other. This caused us to have stale data in the application.

That is when I figured out that I needed to read a bit more on Azure Caching if I was to use it efficiently.

Understanding cache clusters

When you enable caching on an Azure role each instance of that role will run a cache service using a portion of the memory (or all of it if it's a dedicated cache role). The cache services running on each instance of a single role are managed as a single cache cluster. Cache services can talk to each other and synchronize data but only inside the same cache cluster (same role). That is why enabling caching of many roles might not be the best thing to do.

Another thing to mention is that cache clusters can only be created on small role instances or bigger. The reason is that with extra small instance you only get 768MB of RAM which is pretty much all used up by anything you run on those instances.

Now the enable a cache cluster on your role go to the role property page on the Caching tab.

Here you will notice that I also enabled notifications which will allow us to efficiently use local caches later.

For more information on the different configuration options for cache clusters go here.

Configuring roles to use cache clients

Now that we took care of the server side of caching configuration let's talk about the client side. Each instances of each roles inside the same cloud deployment can connect to a cache cluster. If you run only one cluster then you are guarantied to access the same cached data from whatever role you are inside your application (as long you have a valid configuration).

One nice feature we can enable in each role configuration is the local cache client. With this we can cache data locally in a role instance memory the data we recently fetched from the cache cluster for even faster access. Remember the Notification option we enabled on the server side? Using the configuration below in the Web.config or App.config of your role will ensure data stored in the local cache client gets updated whenever the cache server version of that data changes. Basically, the local cache client will invalidate data based on notifications received from the cache cluster.

For more information on client side configuration go here.

Other concerns

This post is only about an overview of the consideration of running multiple clusters versus a single one. Using Azure Caching there are a lot more configuration options you need to take a look at here. Also really important is how to use Azure Caching in your application.

Conclusions

I've spend a lot of time figuring out how all of this was working. I hope this post will help you with your learning experience.

Monday, July 22, 2013

Handling Azure Storage Queue poison messages

This post will talk about what to do now that we are handling poison messages in our Azure Storage Queues.

First, let's review what we'll done so far.

Messages that continuously fail to process will end up in the Error Queue. Someone asked me why do we need error queues at all? We could simply log the errors and delete the message right? Well, if you have a really efficient and pro-active DevOps team I suppose logging errors along with the original messages ought to be enough.

Someone will review why the message failed and if it was only a transient error then he could send the original message again in the queue.

We could also store failed messages into an Azure Storage Table.

Then we could simply monitor new entries in this table and act on it. Again, the original message should be stored in the table so we could send it again if we choose.

I think the best reason to use a queue for error message is if you want to have an administrative tools to monitor, review and re-send messages. In this case the queue mechanics let's you handle those messages like any other process using queues do.

For me one of the unpleasant side effect of using error queues is that all the queues in my system are now multiplied by two (one normal and one error queue). It's not too bad if your naming scheme is consistent but even then if you do operational work using a tool like Cerebrata Azure Management Studio or even from Visual Studio's Server Explorer you will feel overwhelmed by the quantity of queues.

Managing queue messages with Cerebrata Azure Management Studio

Managing queue messages with Visual Studio Server Explorer

Whatever you do, I suggest you always at least log failures properly. Later while debugging the issue you will be thankful to easily match the failure logs with the message who caused it.

Friday, June 21, 2013

Windows Azure Storage Queue with error queues

Windows Azure Storage Queue are an excellent way to execute tasks asynchronously in Azure. For example a web application could let all the heavy processing to a Worker Role instead of doing it itself. That way requests will complete faster. Queues can also be used to decouple communication between two separated applications or two components of the same application.

One problem with asynchronous processing is what to do when the operation fails? When using synchronous patterns like a direct call we usually return an error code or a message or throw an exception. When we use queues to delegate the execution to another process we can't notify the originator directly. One thing we can do is to send a message back to the originator through another queue, like a callback. This is interesting when the process produce a result normally, an error in this case is only another kind of result.

Another way is to create an error queue (or dead letter queue) for poison messages where we put all messages that failed processing. This way we get a list of all the failing messages to review, find out what was the problem with them and figure out what to do about it. For example we can retry the message by moving it back to the main queue so it can be processed again.

Now, let see how we can implement an error queue using Windows Azure Storage Queue.

Implementation of an error queue

First we will initialize the queues. For each queue we also create an '<queuename>-error' queue.

var storageAccount = CloudStorageAccount.Parse("UseDevelopmentStorage=true");
var queueClient = storageAccount.CreateCloudQueueClient();

this.taskQueueReference = queueClient.GetQueueReference("task");
this.taskErrorQueueReference = queueClient.GetQueueReference("task-error");
 
this.taskQueueReference.CreateIfNotExists();
this.taskErrorQueueReference.CreateIfNotExists();

Next we add a few messages with one that will cause the processing to fail (to simulate failures)

this.taskQueueReference.AddMessage(
    new CloudQueueMessage("Message " + DateTime.UtcNow.Ticks));

this.taskQueueReference.AddMessage(
    new CloudQueueMessage("Message " + DateTime.UtcNow.Ticks));

this.taskQueueReference.AddMessage(
    new CloudQueueMessage("Error " + DateTime.UtcNow.Ticks));

Finally the code to actually poll the queue for messages. Usually polling is done in an infinite loop but when no message is fetched it is a good practices to wait a while before polling again to prevent unnecessary transaction cost and IO (each call to GetMessages is 1 transaction). Depending on the need for the queue to react rapidly to new messages this may go between a few seconds for critical tasks to a few minutes for non critical tasks. Also I'm using a retry mechanism here, meaning that I'll try to process a message a few times before I really consider it in error (poison). If we don't delete a message after fetching it then after some time it goes back in the queue to be processed again. This mean all tasks we want to process using queues should be idempotent

private void PollQueue()
{
 IEnumerable<CloudQueueMessage> messages;
 
 do
 {
  messages = this.taskQueueReference
   .GetMessages(8, visibilityTimeout: TimeSpan.FromSeconds(10));

  foreach (var message in messages)
  {
   bool result = false;
   try
   {         
    result = this.ProcessMessage(message);
    
    if (result) this.taskQueueReference.DeleteMessage(message);
   }
   catch (Exception ex)
   {
    this.Log(message.AsString, ex);
   }
   
   if (!result && message.DequeueCount >= 3)
   {
    this.taskErrorQueueReference.AddMessage(message);
    this.taskQueueReference.DeleteMessage(message);
   }
  }
 } while (messages.Any());
}

private bool ProcessMessage(CloudQueueMessage message)
{
 if (message.AsString.StartsWith("Error")) throw new Exception("Error!");
 
 return true;
}

First I'm fetching messages by batch of 8 in this case. In one transaction you can fetch between 1 and 32 messages. Also I set the visibilityTimeout to 10 seconds. This means the messages won't be visible to anyone during that time. Usually you want to set the visibility timeout based on how much time should be required to process all the messages of the batch. If we don't have the time to delete the messages from the queue before the timeout elapse another worker could fetch the message and start processing it again. So we should balance the time to process all the messages in one batch with how much time we want to allow between retries.

Next we process the message. If the processing is successful we simply return true so the message can be deleted from the queue. If processing failed we have two options, return false or throw an exception. I simply return false instead of throwing an exception most of the time when the failure is expected.

Finally, we check how many times we unsuccessfully tried to process the message and if we reached our limit (in this case 3 times). If we did then it's time to send that message to the error queue and delete it from the normal queue.

Next time we will look at how we want to handle the messages in the error queue. You can find this post here.

Thursday, April 7, 2011

Single Responsibility Principle

Working in god classes with monster methods is no fun. It usually involves guessing where I should put the new code exactly. Then it takes a lot of debugging to find out why the change doesn’t work as expected. After that, it takes more time to make sure we did not break anything else around that change. This is definitely not fun at all and it shouldn’t be the way we work. We have control of the code and we should not tolerate a situation like that. We can fix this!

The Single Responsibility Principle or SRP state that

A class should have only one reason to change, only one responsibility.

This means that when I want to add a new feature I don’t want to care about things I’m not changing. For example, below in the class diagram we have the ReportingService class responsible to load and process data, then populate and show a report. This class will be modified every time I need to change my database schema, the report engine behaviour and the layout of the report itself. This mean that it's violating the Single Responsibility Principle, it's doing too much.

If we split the class into separated concerns, we will end up with classes that will need to change for only one reason. It's still not perfect but that's a start.

So, why is this so important again?

Well, first of all reuse. In the diagram above we see that the DAL concern (Data Access Layer) is now encapsulated into a Repository class. This means that we can reuse the Repository code to load the same data but not only for creating reports.

Second, by creating new classes we are forced to think of a name for them. Good naming of classes and methods will help improve comprehension by people who are not familiar with this part of the code base. Another side effect is that we expose hidden concepts from inside our original class like the DAL (now the Repository class).

Third, we are now free to evolve each concern more independently from one another than before. Also, it will be easier to write unit tests for such smaller and simple classes. We reduced the overall complexity of the code and we will be more confident that we are not breaking the application everything we make some changes.

So, should we always break all the large classes into smaller ones until we have only a few methods in each of them?

Not exactly, we need to be careful not to break the OOP basic notion of encapsulation. So where should we draw the line, where should we split our classes? Unfortunately, there is no easy answer, rather we need to understand what the class responsibilities are.

Responsibility Driven Design

One technique we could use comes from the 80s as an OO design practices called RDD for Responsibility Driven Design. The goal was to identify the Object Role Stereotype of classes to better understand their responsibilities. Let's take a look at those six roles.

Information Holder: Knows things and provides information. May make calculations from the data that it holds.

Structurer: Knows the relationships between other objects.

Controller: Controls and directs the actions of other objects. Decides what other objects should do.

Coordinator: Reacts to events and relays the events to other objects.

Service Provider: Does a service for other objects upon request.

Interfacer: Objects that provide a means to communicate with other parts of the system, external systems or infrastructure, or end users.

The second part of this technique is to use CRC Cards which are index card put on a board while designing the systems.

Class name
Responsibilities	Collaborators
- responsibility 1 - responsibility 1 …	- collaborator 1 - collaborator 2 …

CRC stands for Class - Responsibilities - Collaborator. You put all the responsibilities on the left side and all the collaborators (the classes you talk to) on the right side. When putting all the cards of a system or sub-system next to each other on a wall you get a really great big picture of what’s going on.

Back to SRP, our ultimate goal is to take the secondary responsibilities on the left side and transform them into new collaborators on the right. In the end we should have only one responsibility (the principal one) on the left hand side. So, now in the previous example BetterReportingService is only a Controller for its collaborators. Repository is an Interfacer and ReportEngine is a Service Provider. ReportBuilder acts as an Information Holder and a Structurer so we may consider separating the concerns one more time.

Dealing with a large code base

It’s not always easy to look at an aging code base and spot SRP violations. Sometimes we may think a class only have one responsibility when reading the code. The separation opportunities will rarely jump at us.

Fortunately there is one technique that could help us with that. I'm talking about code metrics. Tools like NDepend and to some extent Visual Studio can use reflection technology to look at the code and calculate metrics to help us see the big picture but with a different set of eyes. It can help us find things we could hardly see ourselves.

The LCOM (Lack of Cohesion of Methods) is a metric which indicate whether all the methods and fields of a class are strongly connected together. For that, there should be a connection between all members of the class including fields.

The diagram below shows that the group composed of methods A and B and field x are connected together but they have no connection to the rest of the methods and field. The LCOM4 metric gives us a value of 2 for this class because 2 distinct groups exist.

Now, a class with a LCOM4 value one 1 will look more like this

So, does this mean that the class in diagram 1 have 2 responsibilities and the one in diagram 2 only 1? Not necessarily. Some types of classes like pure domain objects may have only properties exposing the attributes of the class without been connected. We need other indicators to asses that classes may possess multiple responsibilities.

The Cyclomatic Complexity metric gives us the number of execution paths in a method. Code constructs like if, for, foreach, switch and while all generate at least one extra path of execution and contribute to code complexity. A class with high cyclomatic complexity value means that it is actually doing something more than exposing attributes.

So, we should look for classes with a high value of Lack of Cohesion of Methods and high Cyclomatic Complexity. Usually if you know your code base a little bit you should not be really surprised by the top result of that list. They are the ones you need to change for every new feature, the ones where bugs are found periodically and the ones who does not have unit tests around them.

I hope now that you understand what the advantages of more little classes are over a few big ones. That in the long run, it is worth our time to think about the OO design of the system to reduce complexity, responsibility coupling and improve the testability of our code.

Sunday, February 20, 2011

SOLID principles at .Net Montreal Community

In a few weeks on Saturday March 12th I’ll be giving a talk on the SOLID principles of Object Oriented Programming. While collecting information on the subject and preparing my talk I’m figuring out that I’ll be producing quite a lot of materials. That’s a good thing because I’ll be able to blog about the principles using the same materials for my talk’s slides. I do know that a lot of people already wrote about this subject but that’s going to be good exercise for me and my writing skills!

del.icio.us Tags: Architecture,Design,News

Saturday, October 30, 2010

Fun with Mono.Cecil

I’ve always loved playing around with tools like Reflector and NDepend to take a look at the internals of my code base but with an external perspective than inside Visual Studio.

Reflector was really useful to me to walk dependencies and hierarchies before I started using Resharper. NDepend is an amazing piece of software that gives me a lot of metrics and other information. Unfortunately maybe a bit too much, it’s hard to make head and tail a first glance. I’ve never had the chance to play with the full version and the trial is very well done in the way you can use most of the software but it’s nagging you just one step before you could nail something really useful.

Now in Visual Studio 2010 we have the Dependency Graph which is really neat but difficult to use because of missing key features like a way to control or filter the noise of been force to graph the whole solution. You always need to start pruning the graph from scratch and zooming on a single namespace and dependency is hard. Never the less, with a bit of patience I can find a lot of information like dependency circles. But again, when I find one I’m stuck with only visual information, no way to just extract a list of namespaces, types or methods to help me tackle the culprits.

So, that is where I am now. I’d like to have something as powerful as Reflector to walk the links, complete as NDepend and Dependency Graph to get all the information I need to learn about, break bad dependencies but also prevent new ones from appearing in my code base. Wouldn’t be nice to be able to query the code base for metrics, dependencies and general patterns in my code? Tests to help me keep my code clean?

Of course I know that NDepend got this CQL feature. It’s really amazing but with the trial version of the software I can’t go far. And to be honest the editor is a bit clumsy and the language a bit limiting. No, what I need is a better experience, something really interactive to play with my queries until I get them right, something like the experience I get with LinqPad. Then I can put the result of my efforts into a unit test to be sure I’ll won’t repeat the same mistakes again.

.Net reflection API is all good in theory but to use that I’ll need to be really cleaver because reflection works only by loading a dll in memory before querying it. And due to the way the CLR works, a dll can’t be unloaded from the AppDomain meaning that you need to close the app every times you want reload a new version. Too bad for that. There are ways to go around that limitation but I want to keep it my solution simple. So an alternative to reflection is Mono.Cecil. I’ve been hearing a lot of good about this library in the past and I know that a lot of good software use it (like NDepend!) but I never took a look at it myself.

I think it’s time for me to give it a try. I’ll blog about my experience with it trying to create a thin API on top of it to add metrics and a way to list all the bad dependencies in my code base.

The Never Ending Journey