DocumentDb Limits and Statistical Outliers

Monday, April 27, 2015

Azure’s DocumentDB has an appealing scalability model, but you must pay attention to the limits and quotas from the start. Of particular interest to me is the maximum request size for a document, which is currently 512kb. When DocumentDB first appeared the limit was a paltry 16KB, so 512kb feels roomy, but how much real application data can that hold?

Let’s say you need to store a collection of addresses for a hospital patient.

public class Patient
{
    public string Id { get; set; }
    public IList<Address> Addresses { get; set; }
}

public class Address
{
    public string Description { get; set; }
    public string City { get; set; }
    public string Country { get; set; }
}

In theory the list of address objects is an unbounded collection and could exceed the maximum request size and generate runtime errors. But in practice, how many addresses could a single person associate with? There is the home address, the business address, perhaps a temporary vacation address. You don’t want to complicate the design of the application to support unlimited addresses, so instead you might enforce a reasonable limit in the application logic and tell customers that having more than 5 addresses on file is not supported.

A Harder Problem

Here’s a slightly trickier scenario.

public class Patient
{
    public string Id { get; set; }
    public IList<Medication> Medications { get; set; }
}

public class Medication
{
    public string Code { get; set; }
    public DateTime Ordered { get; set; }
    public DateTime Administered { get; set; }
}

Each medication entry consists of an 8 character code and two DateTime properties, which gives us a fixed size for every medication a patient receives, but again the potential problem is the total number of medications a patient might receive.

The first question then, is how many Medication objects can a 512kb request support?

The answer, estimated with a calculator and verified with code, is just over 6,000.

The second question then, is 6,000 a safe number?

To answer the second question I found it useful to analyze some real data and find that the odds of busting the request size are roughly 1 in 100,000, which is just over 4 standard deviations. Generally a 4 sigma number is good enough to say “it won’t happen”, but what’s interesting when operating at scale, is that with 1 million patients you’ll observe the 4 sigma event not once, but 10 times.

From the business perspective, the result is unacceptable, so back to the drawing board.

We use to say that you spend 80% of your time on 20% of the problem. At scale there is the possibility of spending 80% of your time on 0.000007% of the problem.


Comments
gravatar Ryan CrawCour Monday, April 27, 2015
The problem of potentially unlimited "sub documents" or arrays is not something limited to DocumentDB and not something you alone struggle with. I wrote a post on data modeling considerations which you can find at - http://azure.microsoft.com/en-us/documentation/articles/documentdb-modeling-data/ In this article I talk about uncapped documents and why they are bad and some techniques you can use to avoid creating them. There is also this forum thread which deals with a similar problem and ways to deal with this. ttps://social.msdn.microsoft.com/Forums/azure/en-US/8fa348b2-a2f1-4885-b1d2-bf9bb3ce295c/data-modelling-a-document-with-nested-growing-collection-or-separate-growing-collection?forum=AzureDocumentDB It comes down to rethinking your data model and whether you really want to nest everything together in a single document. In your case, you could have a patient document, with the current medication nested and previous medications in a separate document "linked" to the patient. Or a document per medication "referring" back to the patient document. Hope this helps.
gravatar Scott Monday, April 27, 2015
@Ryan: Thanks for the links. There's many scenarios I'm looking at right now where you wouldn't think of the nested collection as an unbound collection. In 99.7% of the use cases data is more than comfortable fitting in 3 or 4 items.
gravatar Arthur Tuesday, April 28, 2015
Hmm, thanks, but curious how did you arrive to the conclusion about the 4 standard deviations?
gravatar Scott Wednesday, April 29, 2015
@Arthur: Just drawing a line in the sand based on existing data and expected workarounds...
Comments are closed.

My Pluralsight Courses

K.Scott Allen OdeToCode by K. Scott Allen
What JavaScript Developers Should Know About ECMAScript 2015
The Podcast!