Skip to content

Using the SDK for Windows Azure Tables

April 10, 2011

This post provides a brief tutorial of using the Azure SDK to read and write to Azure table storage, by offering examples. The MSDN site describes the REST API pretty well, but the library API is barren.

When programming in .NET, it makes sense to use the managed APIs rather than REST calls, because you’re working at a higher level. Since the MSDN documentation is lacking, I had to discover how to query tables by combining scraps I found on various blogs and forums, and also experimenting. I’ve gathered the information here in attempt to make it a bit more cohesive.

Details of how Azure tables and their data model works are described elsewhere; the best resource is probably the white paper. I won’t go into those kind of details here, and instead just focus on how to quickly get started using the API.

Let’s assume we currently don’t yet have a table, so we’ll create a schema from scratch. A primary key and row key must be selected, in additional to the other properties we wish to store (which will be columns in our table).

A contrived example may be to store grades achieved in a class. Let’s say there are various homework assignments and tests, so we’d like to store what each student in the class scored on each. Items to store include the names of the students, the names of the assignments or tests, the score received on it, and possibly other information such as a flag indicating if it was a test.

The obvious choice for keys are student or assignment name, so one may be a primary key and the other a row key. By choosing student as the primary key, a single student’s scores will be stored in the same partition. Alternately, assignment as the primary key results in scores for a single assignment living in a single partition (sorted by student name). Making this decision really depends on how the data is likely to be accessed. Let’s say the most common operation is to print a summary of a student’s progress, so we will more likely access the student’s grades at once. This informs us that student should be the primary key. It’s worth noting that partitioning is how Azure achieves scaling, so by partitioning on student, if we had a particularly worrisome group of students that constantly checked their grades, the load to the tables could be balanced by moving partitions to separate physical devices (which Azure does behind the scenes).

Setting Up

A C# class suffices to define the schema programmatically:

 public class StudentGrade: TableServiceEntity
 {
    public StudentGrade(string name, string assignmentId)
        : base(name, assignmentId)
    { }
 
    public StudentGrade()
    { }
 
    public int Grade { get; set; }
    public bool IsTest { get; set; }
 }

Every table entry has three required columns: the already mentioned row and partition keys, and a time stamp. There is no need to define these, because their associated properties are inherited by deriving from TableServiceEntity. I only need to add public properties for the additional data I wish to store, and Azure will use reflection on my class to create table columns associated with them (the types of these properties must be “approved” types described here, and keys are always strings). I may also add a constructor to populate fields, like the one above that sets the required keys, nevertheless the no-argument constructor is required by the data service, which will construct the class in response to table lookups, so I must include it.

Let’s start defining a class that wraps the table functionality:

public class GradeTable
{
    private readonly string _tableName;
    private readonly CloudTableClient _client;
 
    public GradeTable(string connectionString, string tableName)
    {
        _tableName = tableName;
        var account = CloudStorageAccount.Parse(connectionString);
        _client = account.CreateCloudTableClient();
        _client.CreateTableIfNotExist(_tableName);
    }
    ...

The CreateTableIfNotExist is invoked for safety, otherwise subsequent table operations will fail if the table was never created. The constructor takes two arguments, one is the Azure storage connection string and the other is the table name. I prefer to keep these configurable, because I may use different tables or accounts for testing than production. If the application runs in an Azure role, I’d likely get the values from the role’s configuration settings, like below.

var connectionString = RoleEnvironment.GetConfigurationSettingValue("GradeTableConnectionString");
var tableName= RoleEnvironment.GetConfigurationSettingValue("GradeTableName");
var gradeTable = new GradeTable(connectionString, tableName);

If it was a standalone Windows application, I would use a different technique for reading the configuration.

Adding Entries

Now that the the constructor for our table manager class is ready, here’s a function for the GradeTable class that adds a new entry:

public void AddEntry(StudentGrade entry)
{
    var ctx = _client.GetDataServiceContext();
    ctx.AddObject(_tableName, entry);
    try
    {
        ctx.SaveChangesWithRetries();
    }
    catch (DataServiceRequestException e)
    {
        Console.WriteLine("AddEntry exception: {0}", e.Message);
        var inner = e.InnerException as DataServiceClientException;
        if (inner != null)
        {
            Console.WriteLine("AddEntry inner exception: {0}", inner.Message);
            Console.WriteLine("Status code: {0}", inner.StatusCode);
        }
    }
}

I called SaveChangesWithRetries() instead of SaveChanges(), but either is available. The difference is with retries, the operation will try multiple times before giving up, based on a policy that may be assigned to the context object. In my debugger I looked at the default retry policy, and saw that will attempt 3 times with exponential back-off. You may even write a custom policy if you don’t like the one’s provided by .NET. I usually use the retry version to avoid having to deal with spurious connection errors, which do occur in the cloud.

The exception handling extracts the pertinent information when there is an error, for example as may occur if adding an item that already exists. The exception returned by itself is not very informative, as the message just says: “An error occurred while processing this request.” The inner exception message carries more information:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<error xmlns="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata">
    <code>EntityAlreadyExists</code>
    <message xml:lang="en-US">The specified entity already exists.</message>
</error>

While the text in the <message> tag explains the error, you probably don’t want to parse XML and its inner text to decode it. The DataServiceClientException includes a numerical StatusCode property, which is an HTTP status code, so this is what is easiest for the program to examine (in the case of already existing entity, the code is 409). If my example were production code, I’d map possible status codes to an enumerated error returned from AddEntry.

This above function adds the entry to the table synchronously. Alternately, this may be done asynchronously by adding the following to the grade table object:

// Define a delegate to notify callers when the add is complete
// It will return the status of the operation and opaque state
public delegate void AddEntryCallback(int status, object state);
 
// Class for internally handling the callback 
private class CallbackData
{
    public TableServiceContext Context { get; set; }
    public AddEntryCallback Callback { get; set; }
    public object State { get; set; }
}
 
public void AddEntryAsync(StudentGrade entry, AddEntryCallback callback, object state)
{
    var ctx = _client.GetDataServiceContext();
    ctx.AddObject(_tableName, entry);
    ctx.BeginSaveChangesWithRetries(AddCallback, new CallbackData { Callback = callback, Context = ctx, State = state } );
}
 
// Internal callback handler, will parse the response and return the resulting status to the caller 
void AddCallback(IAsyncResult result)
{
    CallbackData callbackData = null;
    int statusCode = (int) HttpStatusCode.InternalServerError;  // assume the worst
    try
    {
        callbackData = (CallbackData)result.AsyncState;
        callbackData.Context.EndSaveChangesWithRetries(result);
        statusCode = (int) HttpStatusCode.OK;  // no exception, response was ok
    }
    catch (DataServiceRequestException e)
    {
        var inner = e.InnerException as DataServiceClientException;
        if (inner != null)
        {
            statusCode = inner.StatusCode;
        }
    }
    // Notify the caller of what happened
    if (callbackData != null && callbackData.Callback != null)
    {
        callbackData.Callback(statusCode, callbackData.State);
    }
}

Reading Entries

To read the entries from the table, use LINQ queries. Because in this contrived example the most common operation is to ask for all of a student’s scores, here’s a function for GradeTable to run this query:

public IEnumerable<StudentGrade> LookupGradesForStudent(string student)
{
    var ctx = _client.GetDataServiceContext();
    var grades = from g in ctx.CreateQuery<StudentGrade>(_tableName)
                       where g.PartitionKey == student
                       select g;
 
     var query = grades.AsTableServiceQuery();
     return query.Execute();
}

The flexibility of LINQ allows for expressive queries. So we could query by RowKey to get all the scores for any assignment, or by any arbitrary field, for example asking for all scores above 90, or all scores posted before a certain date. Because the PrimaryKey is the student name, the above query is satisfied by accessing only a single partition. Depending on the size of the table, other kinds of queries may incur more overhead.

The result of running the above function is an enumerable list of StudentGrade objects for the requested student. So with this, it’s possible to do things like print the results and take the average:

var name = "Horselover Fat";
var grades = gradeTable.LookupGradesForStudent(name);
Console.WriteLine("Grades for {0}:", name);
foreach (var g in grades.OrderBy(g => g.Timestamp))
    Console.WriteLine("  Grade for {0} is {1}", g.RowKey, g.Grade);
 
double avg = grades.Average(g => (double) g.Grade );
Console.WriteLine("Average grade is {0}", avg);

We may also do this asynchronously by replacing Execute with BeginExecute. I don’t include an example here, but it’s similar to the asynchronous AddGrade function above, where the results are retrieved in a callback. In the callback, the EndExecute function on the context will return the IEnumerable list that Execute returned in the above example.

Updating Entries

Another basic operation to cover is updating an existing entry. As mentioned, attempting to add an item that already exists (i.e. an item with the same primary and row key) results in an error. Adding will not overwrite, so in order to change an entry, we must first fetch it, then update it, then save the changes. Let’s say we can adjust scores to past assignments when students do extra credit work, so our GradeTable class needs a function to update an existing entry:

public void AdjustGrade(string student, string assignment, int adjustment)
{
    var ctx = _client.GetDataServiceContext();
    var keys = from g in ctx.CreateQuery<StudentGrade>(_tableName)
                    where g.PartitionKey == student && g.RowKey == assignment
                    select g;
    var query = keys.AsTableServiceQuery();
    var grade = query.Execute().First();
 
    grade.Grade += adjustment;
 
    ctx.UpdateObject(grade);
    ctx.SaveChangesWithRetries();
}

I left out error and exception handling for brevity.

What happens if we have two threads, each updating the same entry simultaneously, one to adjust the score by +1, and the other to adjust it by +3? If the original score was 80, the expected score is 84. But the threads may have a race condition: both fetch the original grade, one adjusting it to 81, and the other to 83. They both attempt to save their result, so one update may overwrite the other. Luckily this case is avoided, because when the results are fetched from a query, the context tracks a version tag for each item (stored in an Etag). When it submits the update, the tag is compared to the version in the table, and if the item was modified in the intervening period, the update is rejected. This is called “optimistic concurrency,” and it prevents this kind of race condition because the update fails (with status code 412, “ConditionNotMet”), and in response the program can refetch the latest value before attempting to update it again.

Once the grade object is returned from the query, the context tracks the object. If you wish to replace it with a reference to a new object instead of updating the tracked version, first it must be detached. For the above example, this may be accomplished by replacing the UpdateObject with:

ctx.Detach(grade);
ctx.AttachTo(_tableName, newGrade, "*");
ctx.UpdateObject(newGrade);

The last argument to AttachTo is the Etag for the new item, and * means that the existing tag will be ignored on update. As a side-effect of this code, the optimist concurrency effect is negated in the case of race conditions–the last update will take effect even if unintended. If you wish to replace the object, and ensure that it was not updated in the meantime, you need to preserve the Etag. To do so, extract the Etag before detaching from the context like so:

var etag= ctx.Entities.First(e => e.Entity == grade).ETag;
ctx.Detach(grade);
ctx.AttachTo(_tableName, newGrade, etag);
ctx.UpdateObject(newGrade);

Removing Entries

One last operation to mention is removing entries. Deleting an entry is similar to updating, but instead of UpdateObject, call the function DeleteObject (followed by SaveChanges of course). I won’t provide a code sample, since it is so similar to the above example.

Conclusion

I hope this provides enough information for others to get started with the Azure SDK for table storage–I’ve used code similar to this on projects and it covers the common scenarios. I’ve only touched on the basics, but the DataServiceContext class (which is the base class for TableServiceContext) has a lot more fancy stuff to offer. A topic of a future post may be using some of the advanced features.

No comments yet

Leave a Reply

Your email address will not be published. Required fields are marked *