How to Delete Old Data from an Azure Storage Table: Part 2

In part 1 of this series, we explored how to delete data from an Azure Storage Table and covered the simple implementation of the AzureTablePurger tool.

The simple implementation seemed way too slow to me, and took a long time when purging a large amount of data from a table. I figured if I could execute many requests to Table Storage in parallel, it should perform a lot quicker.

To recap, our code fundamentally does 2 things:

  1. Enumerates entities to be deleted

  2. Delete entities using batch operations on Azure Table Storage

Since there could be a lot of data to enumerate and delete, there is no reason why we can't execute these operations parallel.

Since the AzureTablePurger is implemented as a console app, we could parallelize the operation by spinning up multiple threads, or multiple Tasks, or using the Task Parallel Library to name a few ways. The latter is the preferred way to build such a system in .NET 4 and beyond as it simplifies a lot of the underlying complexity of building multi-threaded and parallel applications.

I settled on using the Producer/Consumer pattern backed by the Task Parallel Library.

Producer/Consumer Pattern

Put simply, the Producer/Consumer pattern is where you have 1 thing producing work and putting it in a queue, and another thing taking work items from that queue and actually doing the work.

There is a pretty straightforward way to implement this using the BlockingCollection in .NET.

Producer

In our case, the Producer is the thing that is querying Azure Table Storage and queuing up work:

/// <summary>
/// Data collection step (ie producer).
///
/// Collects data to process, caches it locally on disk and adds to the _partitionKeyQueue collection for the consumer
/// </summary>
private void CollectDataToProcess(int purgeRecordsOlderThanDays)
{
    var query = PartitionKeyHandler.GetTableQuery(purgeRecordsOlderThanDays);
    var continuationToken = new TableContinuationToken();

    try
    {
        // Collect data
        do
        {
            var page = TableReference.ExecuteQuerySegmented(query, continuationToken);
            var firstResultTimestamp = PartitionKeyHandler.ConvertPartitionKeyToDateTime(page.Results.First().PartitionKey);

            WriteStartingToProcessPage(page, firstResultTimestamp);

            PartitionPageAndQueueForProcessing(page.Results);

            continuationToken = page.ContinuationToken;
            // TODO: temp for testing
            // continuationToken = null;
        } while (continuationToken != null);

    }
    finally
    {
        _partitionKeyQueue.CompleteAdding();
    }
}

We start by executing a query against Azure Table storage. Like with our simple implementation, this will execute a query and obtain a page of results, with a maximum of 1000 entities. We then take the page of data and break it up into chunks for processing.

Remember, to delete entities in Table Storage, we need the PartitionKey and RowKey. Since we could be dealing with a large amount of data, it doesn't make sense to store this PartitionKey and RowKey information in memory, otherwise we'll start running out of space. Instead, I decided to cache this on disk in one file for each PartitionKey. Inside the file, we write one line for each PartitionKey + RowKey combination that we want to delete.

We also add a work item to an in-memory queue that our consumer will pull from. Here’s the code:

/// <summary>
/// Partitions up a page of data, caches locally to a temp file and adds into 2 in-memory data structures:
///
/// _partitionKeyQueue which is the queue that the consumer pulls from
///
/// _partitionKeysAlreadyQueued keeps track of what we have already queued. We need this to handle situations where
/// data in a single partition spans multiple pages. After we process a given partition as part of a page of results,
/// we write the entity keys to a temp file and queue that PartitionKey for processing. It's possible that the next page
/// of data we get will have more data for the previous partition, so we open the file we just created and write more data
/// into it. At this point we don't want to re-queue that same PartitionKey for processing, since it's already queued up.
/// </summary>
private void PartitionPageAndQueueForProcessing(List<DynamicTableEntity> pageResults)
{
    _cancellationTokenSource.Token.ThrowIfCancellationRequested();

    var partitionsFromPage = GetPartitionsFromPage(pageResults);

    foreach (var partition in partitionsFromPage)
    {
        var partitionKey = partition.First().PartitionKey;

        using (var streamWriter = GetStreamWriterForPartitionTempFile(partitionKey))
        {
            foreach (var entity in partition)
            {
                var lineToWrite = $"";

                streamWriter.WriteLine(lineToWrite);
                Interlocked.Increment(ref _globalEntityCounter);
            }
        }

        if (!_partitionKeysAlreadyQueued.Contains(partitionKey))
        {
            _partitionKeysAlreadyQueued.Add(partitionKey);
            _partitionKeyQueue.Add(partitionKey);

            Interlocked.Increment(ref _globalPartitionCounter);
            WriteProgressItemQueued();
        }
    }
}

A couple of other interesting things here:

  • The queue we're using is actually a BlockingCollection, which is an in-memory data structure that implements the Producer/Consumer pattern. When we put things in here, the Consumer side, which we’ll explore below, takes them out

  • It's possible (and in fact likely) that we'll run into situations where data having the same PartitionKey ends up spanning multiple pages of results. For example, at the end of page 5, we have 30 records with PartitionKey = 0636213283200000000, and at the beginning of page 6, we have a further 12 records with the same PartitionKey. To handle this situation, when processing page 5, we create a file 0636213283200000000.txt to cache all of the PartitionKey + RowKey combinations and add 0636213283200000000 to our queue for processing. When we process page 6, we realize we already have a cache file 0636213283200000000.txt, so we simply append to the file it. Since we have already added 0636213283200000000 to our queue for processing. We don’t want to have duplicate items in the queue, otherwise we'd end up with multiple consumers each trying to consume or process the same cache file, which doesn't make sense. Since there isn't a built-in way to prevent us from adding duplicates into a BlockingCollection, the easiest option is to simply maintain a parallel data structure (in this case a HashSet) so we can keep track of and quickly query items we've added into the queue to ensure we're not going to add the same PartitionKey into the queue twice

  • We're using Interlocked.Increment() to increment some global variables. This ensures that when we're running multi-threaded, each thread can reliably increment the counter without running into any threading issues. If you're wondering why to use Interlocked.Increment() vs lock or volatile, there is a great discussion on the matter on this Stack Overflow post.

Consumer

The consumer in this implementation is responsible for actually deleting the entities we want to delete. It is given the PartitionKey that it should be dealing with, reads the cached PartitionKey + RowKey combination from disk, and constructs batches of no more than 100 operations at a time and executes them against the Table Storage API:

/// <summary>
/// Process a specific partition.
///
/// Reads all entity keys from temp file on disk
/// </summary>
private void ProcessPartition(string partitionKeyForPartition)
{
    try
    {
        var allDataFromFile = GetDataFromPartitionTempFile(partitionKeyForPartition);

        var batchOperation = new TableBatchOperation();

        for (var index = 0; index < allDataFromFile.Length; index++)
        {
            var line = allDataFromFile[index];
            var indexOfComma = line.IndexOf(',');
            var indexOfRowKeyStart = indexOfComma + 1;
            var partitionKeyForEntity = line.Substring(0, indexOfComma);
            var rowKeyForEntity = line.Substring(indexOfRowKeyStart, line.Length - indexOfRowKeyStart);

            var entity = new DynamicTableEntity(partitionKeyForEntity, rowKeyForEntity) { ETag = "*" };

            batchOperation.Delete(entity);

            if (index % 100 == 0)
            {
                TableReference.ExecuteBatch(batchOperation);
                batchOperation = new TableBatchOperation();
            }
        }

        if (batchOperation.Count > 0)
        {
            TableReference.ExecuteBatch(batchOperation);
        }

        DeletePartitionTempFile(partitionKeyForPartition);
        _partitionKeysAlreadyQueued.Remove(partitionKeyForPartition);

        WriteProgressItemProcessed();
    }
    catch (Exception)
    {
        ConsoleHelper.WriteWithColor($"Error processing partition ", ConsoleColor.Red);
        _cancellationTokenSource.Cancel();
        throw;
    }
}

After we finish processing a partition, we remove the temp file and we remove the PartitionKey from our data structure that is keeping track of the items in the queue, namely the HashSet _partitionKeysAlreadyQueued. There is no need to remove the PartitionKey from the queue, as that already happened when this method was handed the PartitionKey.

Note:

  • This implementation is reading all of the cached data into memory at once, which is fine for the data patters I was dealing with, but we could improve this by doing a streamed read of the data to avoid reading too much information into memory

 

 

Executing both in Parallel

To string both of these together and execute them in Parallel, here's what we do:

public override void PurgeEntities(out int numEntitiesProcessed, out int numPartitionsProcessed)
{
    void CollectData()
    void ProcessData()
    {
        Parallel.ForEach(_partitionKeyQueue.GetConsumingPartitioner(), new ParallelOptions { MaxDegreeOfParallelism = MaxParallelOperations }, ProcessPartition);
    }

    _cancellationTokenSource = new CancellationTokenSource();

    Parallel.Invoke(new ParallelOptions { CancellationToken = _cancellationTokenSource.Token }, CollectData, ProcessData);

    numPartitionsProcessed = _globalPartitionCounter;
    numEntitiesProcessed = _globalEntityCounter;
}

Here we're defining 2 Local Functions, CollectData() and ProcessData(). CollectData will serially execute our Producer step and put work items in a queue. ProcessData will, in parallel, consume this data from the queue.

We then invoke both of these steps in parallel using Parallel.Invoke() which will wait for them to complete. They'll be deemed completed when the Producer has declared there is no more things it is planning to add via:

_partitionKeyQueue.CompleteAdding();

and the Consumer has finished draining the queue.

Performance

Now we're doing this in parallel, it should be quicker, right?

In theory, yes, but initially, this was not the case:

From the 2 runs above you can visually see how the behavior differs from the Simple implementation: each "." represents a queued work item, and each "o" represents a consumed work item. We're producing the work items much quicker than we can consume them (which is expected), and once our production of work items is complete, then we are strictly consuming them (also expected).

What was not expected was the timing. It's taking approximately the same time on average to delete an entity using this parallel implementation as it did with the simple implementation, so what gives?

Turns out I had forgotten an important detail - adjusting the number of network connections available by using the ServicePointManager.DefaultConnectionLimit property. By default, a console app is only allowed to have a maximum of 2 concurrent connections to a given host. In this app, the producer is typing up a connection while it is hitting the API and obtaining PartitionKeys + RowKeys of entities we want to delete. There should be many consumers running at once sending deletion requests to the API, but there is only 1 connection available for them to share. Once all of the production is complete, then we'd free up that connection and now have 2 connections available for consumers. Either way, we're seriously limiting our throughput here and defeating the purpose of our much fancier and more complicated Parallel solution which should be quicker, but instead is running at around the same pace as the Simple version.

I made this change to rectify the problem:

private const int ConnectionLimit = 32;

public ParallelTablePurger()
{
    ServicePointManager.DefaultConnectionLimit = ConnectionLimit;
}

Now, we're allowing up to 32 connections to the host. With this I saw the average execution time drop to around 5-10ms, so somewhere between 2-4x the speed of the simple implementation.

I also tried 128 and 256 maximum connections, but that actually had a negative effect. My machine was a lot less responsive while the app was running, and average deletion times per entity were in the 10-12ms range. 32 seemed to be somewhere around the sweet spot.

For the future: turning it up to 11

There are many things that can be improved here, but for now, this is good enough for what I need.

However, if we really wanted to turn this up to 11 and make it a lot quicker, a lot more efficient and something that just ran constantly as part of our infrastructure to purge old data from Azure tables, we could re-build this to run in the cloud.

For example, we could create 2 separate functions: 1 to enumerate the work, and a second one that will scale out to execute the delete commands:

1. Create an Azure Function to enumerate the work

As we query the Table Storage API to enumerate the work to be done, put the resulting PartitionKey + RowKey combination either into a text file on blob storage (similar to this current implementation), or consider putting it into Cosmos DB or Redis. Redis would probably be the quickest, and Blob Storage probably the cheapest, and has the advantage that we avoid having to spin up additional resources.

Note, if we were to deploy this Azure Function to a consumption plan, functions have a timeout of 10 minutes. We'd need to make sure that we either limit our run to less than 10 minutes in duration, or we can handle being interrupted part way through and pick up where we left the next time the function runtime calls our functio

2. Create an Azure Function to do the deletions

After picking up an item from the queue, this function would be responsible for reading the cached PartitionKey + RowKey and batching up delete operations to send to Table Storage. If we deployed this as an Azure Function on a consumption plan, it will scale out to a maximum of 200 instances, which should be plenty for us to work through the queue of delete requests in a timely manner, and so long as we're not exceeding 20,000 requests per second to the storage account, we won't hit any throttling limits there.

Async I/O

Something else we'd look to do is to make the I/O operations asynchronous. This would probably provide some benefits in our console app implementation, although I'm not sure how much difference it would really make. I decided against async I/O for the initial implementation, since it doesn't play well with Parallel.ForEach (check out this Stack Overflow post or this GitHub issue for some more background), and would require a more complex implementation to work.

With a move to the cloud and Azure Functions, I'm not sure how much benefit async I/O would have either, since we rely heavily on scaling out the work to different processes and machines via Azure Functions. Async is great for increasing throughput and scalability, but not raw performance, which is what our goal is here.

Side Note: Durable Functions

Something else which is pretty cool and could be useful here is Azure Durable functions. This could be an interesting solution to this problem as well, especially if we wanted to report back on aggregate statistics like how long it took to complete all of the work and average time to delete a specific record (ie the fan-in).

Conclusion

It turns out the Parallel solution is quicker than the Simple solution. Feel free to use this tool for your own purposes, or if you’re interested in improving it, go ahead and fork it and make a pull request. There is a lot more that can be done :)

How to Delete Old Data From an Azure Storage Table : Part 1

Have you ever tried to purge old data from an Azure table? For example, let's say you're using the table for logging, like the Azure Diagnostics Trace Listener which logs to the WADLogsTable, and you want to purge old log entries.

 There is a suggestion on Azure feedback for this feature, questions on Stack Overflow (here's one and another), questions on MSDN Forums and bugs on Github.

 There are numerous good articles that cover deletion of entities from Azure Tables.

 However, no tools are available to help.


UPDATE: I’m creating a SaaS tool that will automatically delete data from your Azure tables older than X days so you don’t need to worry about building this yourself. Interested? Sign up for the beta (limited spots available):


TL;DR: while the building blocks exist in the Azure Table storage APIs to delete entities, there are no native ways to easily bulk delete data, and no easy way to purge data older than a certain number of days. Nuking the entire table is certainly the easiest way to go, so designing your system to roll to a new table every X days or month or whatever (for example Logs201907, Logs201908 for logs generated in July 2019 and August 2019 respectively) would be my recommendation.

If that ship has sailed and you stuck in a situation where you want to purge old data from a table, like I was, I created a tool to make my life, and hopefully your life as well, a bit easier.

 The code is available here: https://github.com/brentonw/AzureTablePurger

Disclaimer: this is prototype code, I've run this successfully myself and put it through a bunch of functional tests, but at present, it doesn’t have a set of unit tests. Use at your own risk :)

How it Works

Fundamentally, this tool has 2 steps:

  1. Enumerates entities to be deleted

  2. Delete entities using batch operations on Azure Table Storage

I started off building a simple version which synchronously grabs a page of query results from the API, then breaks that page into batches of no more than 100 items, grouped by PartitionKey - a requirement for the Azure Table Storage Batch Operation API, then execute batch delete operations on Azure Table Storage.

Enumerating Which Data to Delete

Depending on your PartitionKey and RowKey structure, this might be relatively easy and efficient, or it might be painful. With Azure Table Storage, you can only query efficiently when using PartitionKey or a PartitionKey + RowKey combination. Anything else results in a full table scan which is inefficient and time consuming. There is tons of background on this in the Azure docs: Azure Storage Table Design Guide: Designing Scalable and Performant Tables.

Querying on the Timestamp column is not efficient, and will require a full table scan.

If we take a look at the WADLogsTable, we'll see data similar to this:

 

PartitionKey = 0636213283200000000
RowKey = e046cc84a5d04f3b96532ebfef4ef918___Shindigg.Azure.BackgroundWorker___Shindigg.Azure.BackgroundWorker_IN_0___0000000001652031489

 

Here's how the format breaks down:

PartitionKey = "0" + the minute of the number of ticks since 12:00:00 midnight, January 1, 0001

RowKey = this is the deployment ID + the name of the role + the name of the role instance + a unique identifier

 

Every log entry within a particular minute is bucketed into the same partition which has 2 key advantages:

  1. We can now effectively query on time

  2. Each partition shouldn't have too many records in there, since there shouldn't be that many log entries within a single minute

Here's the query we construct to enumerate the data:

public TableQuery GetTableQuery(int purgeEntitiesOlderThanDays)
{
    var maximumPartitionKeyToDelete = GetMaximumPartitionKeyToDelete(purgeEntitiesOlderThanDays);

    var query = new TableQuery()
        .Where(TableQuery.GenerateFilterCondition("PartitionKey", QueryComparisons.LessThanOrEqual, maximumPartitionKeyToDelete))
        .Select(new[] { "PartitionKey", "RowKey" });

    return query;
}

private string GetMaximumPartitionKeyToDelete(int purgeRecordsOlderThanDays)
{
    return DateTime.UtcNow.AddDays(-1 * purgeRecordsOlderThanDays).Ticks.ToString("D19");
}

The execution of the query is pretty straightforward:

var page = TableReference.ExecuteQuerySegmented(query, continuationToken);

The Azure Table Storage API limits us to 1000 records at a time. If there are more than 1000 results, the continuationToken will be set to a non-null value, which will indicate we need to make the query again with that particular continuationToken to get the next page of data from the query.

Deleting the Data

In order to make this as efficient as possible and minimize the number of delete calls we make to the Auzre Table Storage API, we want to batch things up as much as possible.

Azure Table Storage supports batch operations, but there are two caveats we need to be aware of (1) all entities must be in the same partition and (2) there can be no more than 100 items in a batch.

To achieve this, we're going to break our page of data into chunks of no more than 100 entities, grouped by PartitionKey:

protected IList<IList<DynamicTableEntity>> GetPartitionsFromPage(IList<DynamicTableEntity> page)
{
    var result = new List<IList<DynamicTableEntity>>();

    var groupByResult = page.GroupBy(x => x.PartitionKey);

    foreach (var partition in groupByResult.ToList())
    {
        var partitionAsList = partition.ToList();
        if (partitionAsList.Count > MaxBatchSize)
        {
            var chunkedPartitions = Chunk(partition, MaxBatchSize);
            foreach (var smallerPartition in chunkedPartitions)
            {
                result.Add(smallerPartition.ToList());
            }
        }
        else
        {
            result.Add(partitionAsList);
        }
    }

    return result;
}

Then, we'll iterate over each partition, construct a batch operation and execute it against the table:

foreach (var entity in partition)
{
    Trace.WriteLine(
        $"Adding entity to batch: PartitionKey=, RowKey=");
    entityCounter++;

    batchOperation.Delete(entity);
    batchCounter++;
}

Trace.WriteLine($"Added  items into batch");

partitionCounter++;

Trace.WriteLine($"Executing batch delete of  entities");
TableReference.ExecuteBatch(batchOperation);

Here's the entire processing logic:

public override void PurgeEntities(out int numEntitiesProcessed, out int numPartitionsProcessed)
{
    VerifyIsInitialized();

    var query = PartitionKeyHandler.GetTableQuery(PurgeEntitiesOlderThanDays);

    var continuationToken = new TableContinuationToken();

    int partitionCounter = 0;
    int entityCounter = 0;

    // Collect and process data
    do
    {
        var page = TableReference.ExecuteQuerySegmented(query, continuationToken);
        var firstResultTimestamp = PartitionKeyHandler.ConvertPartitionKeyToDateTime(page.Results.First().PartitionKey);

        WriteStartingToProcessPage(page, firstResultTimestamp);

        var partitions = GetPartitionsFromPage(page.ToList());

        ConsoleHelper.WriteLineWithColor($"Broke into  partitions", ConsoleColor.Gray);

        foreach (var partition in partitions)
        {
            Trace.WriteLine($"Processing partition ");

            var batchOperation = new TableBatchOperation();
            int batchCounter = 0;

            foreach (var entity in partition)
            {
                Trace.WriteLine(
                    $"Adding entity to batch: PartitionKey=, RowKey=");
                entityCounter++;

                batchOperation.Delete(entity);
                batchCounter++;
            }

            Trace.WriteLine($"Added  items into batch");

            partitionCounter++;

            Trace.WriteLine($"Executing batch delete of  entities");
            TableReference.ExecuteBatch(batchOperation);

            WriteProgressItemProcessed();
        }

        continuationToken = page.ContinuationToken;

    } while (continuationToken != null);

    numPartitionsProcessed = partitionCounter;
    numEntitiesProcessed = entityCounter;
}

Performance

I ran this a few times on a subset of data. On average it takes around 10-20ms depending on the execution run to delete each entity:

This seemed kind of slow to me, and when trying to purge a lot of data, it takes many hours to run. I figured I should be able to improve the speed dramatically by turning this into a parallel operation.

Stay tuned for part 2 of this series to dive deeper into the parallel implementation.

Conclusion for Part 1

Since Azure Table Storage has no native way to purge old data, your best best is to structure your data so that you can simply delete old tables when you no longer need them and purge data that way.

However, if you can’t do that, feel free to make use of this AzureTablePurger tool.

FastBar's Technical Architecture

Previously, I've discussed the start of FastBar, how the client and server technology stacks evolved and what it looks like today (read more in part 1, part 2 and part 3 of that series).

As a recap, here's what the high-level components of FastBar look like:

FastBar Components - High Level.png

Let's dive deeper.

Architecture

So, what does FastBar’s architecture look like under the hood? Glad you asked:

Client apps:

  • Registration: used to register attendees at an event, essentially connecting their credit card to a wristband and all of the associated features that are required at a live event

  • Point of Sale: used to sell stuff at events

These are both mobile apps built in Xamarin and running on iOS devices.

The server is more complicated from an architectural standpoint, and is divided into the following primary components:

  • getfastbar.com - the primary customer facing website. Built on Squarespace, this provide primarily marketing content for anyone wanting to learn about FastBar

  • app.getfastbar.com - our main web app which provides 4 key functions:

    • User section - as user, there are a few functions you can perform on FastBar, such as creating an account, adding a credit card, updating your user profile information and if you've got the permissions, creating events. This section is pretty basic

    • Attendee section - as an attendee, you can do things like pre-register for an event, view your tab, change your credit card, email yourself a receipt and adjust your tip. This is the section of the site that receives the most traffic

    • Event Control Center section - this is by far the largest section of the web app, it's where events can be fully managed: configuring details, connecting payment accounts, configuring taxes, setting up pre-registration, managing products and menus, viewing reports and downloading data and a whole lot more. This is where event organizers and FastBar staff spend the majority of their time

    • Admin section - various admin related features used by FastBar support staff. The bulk of management related to a specific event, they would do from the Event Control Center if acting on behalf of an event organizer

  • api.getfastbar.com - our API, primarily used by our own internal apps. We also open up various endpoints to some partners. We don’t make this broadly accessible publicly yet because it doesn't need to be. However, it’s something we may decide to open up more broadly in the future

The main web app and API share the same underlying core business logic, and are backed by a variety of other components, including:

  • WebJobs:

    • Bulk Message Processor - whenever we're sending a bulk message, like an email or SMS that is intended to go to many attendees, the Bulk Message Processor will be responsible for enumerating and queuing up the work. For example, if we were to send out a bulk SMS to 10,000 attendees of the event, whatever initiates this process (the web app or the API) will queue up a work item for the Bulk Message Processor that essentially says "I want to send a message to a whole bunch of people". The Bulk Message Processor will pick up the message and start enumerating 10,000 individual work items that it will queue up for processing by the Outbound SMS Processor, a downstream component. The Outbound SMS Processor will in turn pick up each work item and send out individual SMSs

    • Order Processor - whenever we ingest orders from the POS client via the API, we do the minimal amount of work possible so that we can respond quickly to the client. Essentially, we're doing some initial validation and persisting the order in the database, then queuing a work item so that the Order Processor can take care of the heavy lifting later, and requests from the client are not unnecessarily delayed. This component is very active during an event

    • Outbound Email Processor - responsible for sending an individual email, for example as the result of another component that queued up some work for it. We use Mailgun to send emails

    • Outbound Notification Processor - responsible from sending outbound push notifications. Under the covers this uses Azure Notification Hub

    • Outbound SMS Processor - responsible for sending individual SMS messages, for example a text message update to an attendee after they place an order. We send SMSs via Twilio

    • Sample Data Processor - when we need to create a sample event for demo or testing purposes, we send this work to the Sample Data Processor. This is essentially a job that a admin user may initiate from the web app, and since it could take a while, the web app will queue up a work item, then the Sample Data Processor picks it up and goes to work creating a whole bunch of test data in the background

    • Tab Authorization Processor - whenever we need to authorize someone's credit card that is connected to their tab, the Tab Authorization Processor takes care of it. For example, if attendees are pre-registering themselves for an event week before hand, we vault their credit card details securely, and only authorize their card via the Tab Authorization Processor 24 hours before the event starts

    • Tab Payment Processor - when it comes time execute payments against a tab, the Tab Payment Processor is responsible for doing the work

    • Tab Payment Sweeper - before we can process a tab's payment, that work needs to be queued. For example, after an event, all tabs get marked for processing. The Tab Payment Sweeper runs periodically, looking for any tabs that are marked for processing, and queues up work for the Tab Payment Processor. It's similar in concept to the Bulk Message Processor in that it's responsible for queuing up work items for another component

    • Tab Authorization Sweeper - just like the Tab Payment Sweeper, the Tab Authorization Sweeper looks for tabs that need to be authorized and queues up work for the Tab Authorization Processor

  • Functions

    • Client Logs Dispatcher - our client devices are responsible for pushing up their own zipped-up, JSON formatted log files to Azure Blob Storage. The Client Logs Dispatcher then takes the logs and dispatches them to our logging system, which is Log Analytics, part of Azure Monitor

    • Server Logs Dispatcher - similar in concept to the Client Logs Dispatcher, the Server Logs Dispatcher is responsible for taking server-side logs, which initially get placed into Azure Table Storage, and pushing them to Log Analytics so we have both client and server logs in the same place. This allows us to do end to end queries and analysis

    • Data Exporter - whenever a user requests an esport of data, we handle this via the Data Exporter. For large events, an export could take some time. We don’t want to tie up request threads or hammer the database, so we built the Data Exporter to take care of this in the background

    • Tab Recalculator - we maintain a tab for each attendee at an event, it's essentially the summary of all of their purchases they've made at the event. From time to time, changes happen that require us to recalculate some or all tabs for an event. For example, let's say the event organizer realized that beer was supposed to be $6, but was accidentally set for $7 and wanted to fix this going forward, and for all previous orders. This means we need to recalculate all tabs that have orders involving the affected products. For a large event there could be many thousands of tabs affected by this change, and since each tab has unique characteristics, including the rules around how totals should be calculated, this has to be done individually for each tab. The Tab Recalculator takes care of this work in the background

    • Tags Deduplicator - FastBar is a complicated distributed system that support offline operation of client devices like the Registration and POS apps. On the server side, we also process things in parallel in the background. Long story short, these two characteristics mean that sometimes data can get out of sync. The Tags Deduplicator helps put some things back in sync so eventually arrive at a consistent state

Azure Functions vs WebJobs

So, how come some things are implemented as Functions and some as WebJobs? Quite simply, the WebJobs were built before Azure Functions existed and/or before Azure Functions really became a thing.

Nowadays, it seems as though Azure Functions are the preferred technology to use so we made the decision a while ago to create any new background components using Functions, and, if any significant refactoring was required to a WebJob, we'll take the opportunity to move it over to a Function as well.

Over time, we plan on phasing out WebJobs in favor of Functions.

Communication Between the Web App / API and Background Components

This is almost exclusively done via Azure Storage Queues. The only exception is the Client Logs Dispatcher, which can also be triggered by a file showing up in Blob Storage.

Azure has a number of queuing solutions that could be used here. Storage queues is a simple solution that does what we need, so we use it.

Communication with 3rd Party Services

Wherever we can, we’ll push interaction with 3rd party services to background components. This way, if 3rd party services are running slowly or down completely for a period of time, we minimize impact on our system.

Blob and Table Storage

We utilize Blog storage in a couple of different ways:

  • Client apps upload their logs directly to blog storage for processing bys the Client Logs Dispatcher

  • Client apps have a feature that allows the user to create a bug report and attach local state. The bug is logged directly into our work item tracking system, Pivotal Tracker. We also package up the client side state and upload it to blob storage. This allows developers to re-create the state on a client device on their own device, or the simulator for debugging purposes

Table storage is used for the initial step in our server-side logging. We log to Table storage, and then push that data up to Log Analytics via the Server Logs Dispatcher.

Azure SQL

Even though there are a lot of different technologies to store data these days, we use Azure SQL for a few key reasons: it’s familiar, it works, it’s a good choice for a financial system like FastBar where data is relational and we require ACID semantics.

Conclusion

That’s a brief overview of FastBar’s technical architecture. In future posts, I’ll go over more of the why behind the architectural choices and the key benefits that it has.

Choosing a Tech Stack for Your Startup - Part 3: Cloud Stacks, Evolving the System and Lessons Learnt

This is the final part in a 3 part series around choosing a tech stack for your startup:

  • In part 1, we explored the choices we made and the evolution of FastBar’s client apps

  • In part 2, we started the exploration of the server side, including our technology choices and philosophy when it came to building out our MVP

  • In part 3, this post, we’ll wrap-up our discussion of the server side choices and summarize our key lessons learnt.

As a recap, here’s the areas of the system we’re focused on we’re focused on:

FastBar Components - Server Hilighted.png

And in part 2, we left off discussing self-hosting vs utilizing the cloud. TL;DR - forget about self hosting and leverage the cloud.

Next step - let’s pick a cloud provider…

AWS vs Azure

In 2014, AWS was the undisputed leader in the cloud, but Azure was quickly catching up in capabilities and feature set.

Through the accelerator we got into, 9 Mile Labs, we had access to some free credits with both AWS and Azure.

I decided to go with Azure, in part because they offered more free credits via their Bizpark Plus program than what AWS was offering us, in part because I was more familiar with their technology than that of AWS, in part because I'm generally a fan of Microsoft technology, and in part because I wanted to take advantage of their Platform as a Service (PaaS) offerings. Specifically, Azure App Service Web Apps and Azure SQL - AWS didn't have any direct equivalents for those at the time. I could certainly spin up VMs and install my own versions of IIS and SQL on AWS, but that was more work for me, and I had enough things to do.

PaaS vs IaaS

After doing some investigation into Azure's PaaS offerings, namely App Service Web Apps and Azure SQL, I decided to give them a go.

With PaaS offerings, you're trading some flexibility for convenience.

For example, with Web Apps, you don’t deploy your app to a machine or a specific VM - you deploy it to Azure's Web app service, and it deploys it to one or more VMs on your behalf. You don’t remote desktop into the VM to poke around - you use the web-based tools or APIs that Microsoft provides for you. Azure SQL doesn't support all of the features that regular SQL does, but it supports most of them. You don’t have the ability to configure where your database and log files will be placed, Azure manages that for you. In most cases, this is a good thing, as you've got better things to do.

With Web Apps, you can easily setup auto-scaling, like I described in part 2, and Azure will magically create or destroy more VMs according to the rules you setup, and route traffic between them. With SQL Azure, you can do cool things like create read-only replicas and geo-redundant failover databases within minutes:

If there is a PaaS offering of a particular piece of infrastructure that you require on whatever cloud you're using, try it out. You'll be giving up some flexibility, but you'll get a whole lot in return. For most scenarios, it will be totally worth it.

3rd Party Technologies

Stripe

The first 3rd party technology service we integrated was Stripe - FastBar is a payment system after all, so we needed a way to vault credit cards and do payment processing. At the time Stripe was the gold standard in terms of developer friendly payment APIs, so we went with it and still use it to this day. We've had our fair share of issues with Stripe, but overall it's worked well for us.

Loggly: A Cautionary Tale

Another piece of 3rd party tech we used early on was Loggly. This is essentially “logging as a service” and instead of you having to figure out how to ingest, process and search large volumes of log data, Loggly provides a cloud-based service for you.

We used this for a couple of years and eventually moved off it because we found the performance was not reliable.

We ran into an indecent one time where Loggly, which typically would respond to our requests in 200-300ms, was taking 90-120 seconds to respond (ouch!). Some of our server-side web and API code called Loggly directly as part of the request execution path (a big no-no, that was our bad) and needless to say, when your request thread is tied up waiting for a network call that is going to take 90-120 seconds, everything went to hell.

During the incident, it was tough for us to figure out what was going on, since our logging was impacted. After the incident, we analyzed and eventually tracked down 90-120 second response times from Loggly as the cause. We made changes to our system so that we would never again call Loggly directly as part of a request's execution path, rather we'd log everything "locally" within the Azure environment and have a background process that would push it up to Loggly. This is really what we should have been doing from the beginning. At the same time, Loggly should have been more robust.

This made us immune to any future slowdowns on the Loggly side, but over time we still found that our background process was often having trouble sending data to Loggly. We had an aot-retry mechanism setup so we’d keep retrying to send to Loggly until we succeeded. Eventually this would work, but we found this retry mechanism was bring triggered way too often for our liking. We also found similar issues on our client apps, where we'd have our client apps send logs directly to Loggly in the background to avoid us having to send to our server, then to Loggly. This was more of an issue, since clients operate in constrained bandwidth environments.

Overall, we experienced lots of flakiness with Loggly regardless of if we were communicating with it from the client or server.

In addition, the cheaper tiers of Loggly are quite limited in the amount of data you can send to them. For a large event, we'd quickly hit the data cap, and the rest of our logs would be dropped. This made the Loggly search features (which were awesome by the way, and one of the key things that attracted us to Loggly) pretty much useless for us, since we'd only have a fraction of our data available unless we moved up to a significantly more expensive tier.

We removed Loggly from the equation in favor of Azure's Log Analytics (now renamed to Azure Monitor). It's inside Azure with the rest of our stuff, has awesome query capabilities (on par with Loggly) and it’s much cheaper for us due to its “cloud-based pricing model” that scales based on the amount you use it, as opposed to handful of main pricing buckets with Loggly.

Twilio

We use Twilio for sending SMS messages. Twilio has worked great for us from the early days, and we don’t have any plans to change it anytime soon.

Cloudinary

On a previous project, I got deep into the complexities of image processing: uploading, cropping, resizing and hosting, distributing to a CDN etc…

TL;DR it's something that seems really simple on the surface, but quickly spirals out of control - it’s a hard problem to solve properly. 

I learnt my lesson on a previous project, and on FastBar, I did not pass Go and did not collect $200, rather I went straight to Cloudinary. It's a great product, easy to use, and it removes all of our image processing and hosting hassles.

Mailgun

Turns out sending email is hard. That’s why companies like Mailgun and Sendgrid exist.

We decided to go with Mailgun since it had a better pricing model for our purposes compared to Sendgrid. But fundamentally, they’re both pretty similar. They help you take care of the complexities of sending reliable email so you don’t have to deal with it.

Building out the Event Control Center

As our client apps and their underlying APIs started to mature, we started turning our development focus to building out the Event Control Center on the server - the place where event organizers and FastBar staff could fully configure all aspects of the event, manage settings, configure products and menus, view reports etc…

This was essentially a traditional web app. We looked at using tech like React or Angular. As we speced out our screens, we realized that our requirements were pretty straightforward. We didn't have lots of pages that needed a rich UI, we didn’t have a need for a Single Page App (SPA), and overall, our pages were pretty simple. We decided to go with a more "traditional" request/response ASP.NET web app, using HTML 5, JS, CSS, Bootstrap, Jquery etc…

The first features we deployed were around basic reporting, followed by the ability to create events, edit settings for the event, view and manage attendee tabs, manage refunds, create and configure products and menu items.

Nowadays, we've added multi user support, tax support, comprehensive reporting and export, direct and bulk SMS capabilities, configuration of promotions (ie discounts), device management, attendee surveys and much more.

The days of managing via SQL Management Studio are well in the past (thankfully!).

Re-building the public facing website

For a long time, the public facing section of the website was a simple 1-pager explanation of FastBar. It was long overdue for a refresh, so in 2018 I set out to rebuild it, improve the visuals, and most importantly, update the content to better reflect what we had to offer customers.

For this, we considered a variety of options, including: custom building an ASP.NET site, Wordpress, Wix, Squarespace etc...

Building a custom ASP.NET website was kind of a hassle. Our pages were simple, but it would be highly beneficial if non-developers could easily edit content, so we really needed a basic CRM. This meant we needed to go for a self-hosted CRM, like Wordpress, or a hosted CRM, like Wordpress.com, Wix or Squarespace.

I had build and deployed enough basic Wordpress sites to know that I didn't want to spend our time, effort and money on self-hosting it. Self-hosting means having to deal with constant updates to the platform and the plugins (Wordpress is a ripe target for hackers, so keeping everything up to date is critical), managing backups and the like.

We were busy enough building features for FastBar system, I didn’t want to allocate precious dev resources to the public facing website when a hosted solution at $12/mo (or thereabouts) would be sufficient.

For Wordpress in general, I found it tough to find a good quality template that matches the visuals I was looking for. To be clear, there are a ton of templates available, I'd say way too many. I found it really hard to hunt through the sea of mediocrity to find something I really liked.

When evaluating the hosted offerings like Squarespace and Wix, my first concern was that as a technology company I was worried potential engineering hires might judge us for using something like that. I don’t know about you, but I'll often F12 or Ctrl-U a website to see what's going on under the hood :) Also, while quick to spin up, hosted offerings like Squarespace lacked what I consider basic features, like version control, so that was a big red flag.

Eventually I determined that the pros and simplicity of a hosted offering outweighed the cons and we went with Squarespace. Within about a week, we had the site re-designed and live - the vast majority of that time was spent on the marketing and messaging, the implementation part was really easy.

Where we're at today

Today, our backend is comprised of 3 main components: the Core Web App and API, our public facing website and 3rd party services that we depend on.

Our core Web App and API is built in ASP.NET and WebAPI and runs on Azure. We leverage Azure App Services, Azure SQL, Azure Storage service (Blob, Table and Queue), Azure Monitor (Application Insights and Log Analytics), Azure Functions, WebJobs, Redis and a few other bits and pieces.

The public facing website runs on Squarespace. 

The 3rd party services we utilize are Stripe, Cloudinary, Twilio and Mailgun.

Lessons Learnt

Looking back at our previous lessons learnt from client side development:

  1. Optimize for Productivity

  2. Choose Something Popular

  3. Choose the Simplest Thing That Works

  4. Favor Cross Platform Tech

The first 3 are highly applicable to the server side development. The 4th is more client specific. You could make the argument that it’s valuable server-side as well, it depends on how many server environments you’re planning on deploying to. In most cases, you’re going to pick a stack and stick with it, so it’s less relevant.

Here are some additional lessons we learnt on the server side.

Ruthlessly Prioritize

This one applies to both client and server side development. As a startup, it's important to ruthlessly prioritize your development work and tackle the most important items first. How far can you go without building out an admin UI and instead relying on SQL scripts? What's the most important thing that your customers need right now? What is the #1 feature that will help you move the product and business forward?

Prioritization is hard, especially because it's not just about the needs of the customer. You also have to balance the underlying health of the code base, the design and the architecture of the system. You need to be aware of any technical debt you're creating, and you need to be careful not to paint yourself into a corner that you might not be able to get out of later. You need to think about the future, but not get hung up on it too much that you adopt unnecessary work now that never ends up being needed later. Prioritization requires tough tradeoffs. 

Prioritization is more art than science and I think it's something that continues to evolve with experience. Prioritize the most important things you need right now, and do your best to balance that with future needs.

Just go Cloud

Whether you're choosing AWS, Azure, Google or something else, just go for the cloud. It's pretty much a given these days, so hopefully you don't have the urge to go to the dark side and buy and host your own servers. 

Save yourself the hassle, the time and the money and utilize the cloud. Take advantage of the thousands upon thousands of developers working at Amazon, Microsoft and Google who are working to make your life easier and use the cloud.

Speaking of using the cloud…

Favor PaaS over IaaS

If there is a PaaS solution available that meets you needs, favor it over IaaS. Sure, you'll lose some control, but you'll gain so much in terms of ease of use and advanced capabilities that would be complicated and time consuming for your to build yourself.

It means less work for you, and more time available to dedicate to more important things, so favor PaaS over IasS.

Favor Pre-Built Solutions

Better still, if there is an entire solution available to you that someone else hosts and manages, favor it.

Again, less work for you, and allows you to focus your time, energy and resources on more important problems that will provide value to your customers, so favor pre-built solutions.


Conclusion

In part 1 we discussed client side technology choices we went through when building FastBar, including our thinking around Android vs iOS, which client technology stack to use, how our various apps evolved, where we’re at today, and key lessons learnt.

In part 2 and part 3, this post, we discussed the server side, including choosing a server side stack, building out an MVP, deciding to self-host or utilize the cloud, AWS vs Azure, various other 3rd party technologies we adopted, where we’re at today and more lessons learnt, primarily related to server side development.

Hopefully you can leverage some of these lessons in building your own startup. Good luck, and go change the world!