OdeToCode IC Logo

Counting Array Entries in a Cosmos DB Document

Thursday, August 30, 2018

I’ve been trying to figure out the most efficient approach to counting matches in a Cosmos DB array, without redesigning collections and documents.

To explain the scenario, imagine a document based on the following class.

class Patient
{
    [JsonProperty(PropertyName = "id")]
    public string Id { get; set; }

    public int[] Attributes { get; set; }   
} 

I need a query to return patient documents with a count of how many values in the Attributes property match the values in an array I provide as a parameter.

That’s very abstract, so imagine the numbers in the array each represent a patient’s favorite food. 1 is pasta, 2 is beef, 3 is pork, 4 is chicken, 5 is eggplant, 6 is cauliflower, and so on. An Attributes value of [1, 4, 6] means a patient likes pasta, chicken, and cauliflower.

Now I need to issue a query to see what patients think of a meal that combines pasta, chicken, and eggplant (a [1, 4, 5]).

Cosmos provides a number of aggregation and array operators, including an ARRAY_CONTAINS, but to make multiple queries with dynamic parameter values, I thought a user-defined function might be easier.

In Cosmos, we implement UDFs as JavaScript functions. Here’s a UDF that takes two arrays and counts the number of items in the arrays that intersect, or match, so intersectionCount([2,5], [1,2,3,5,7,10]) returns 2.

function intersectionCount(array1, array2) {
    var count = array1.reduce((accumulator, value) => {
        if (array2.indexOf(value) > -1) {
            return accumulator + 1;
        }
        return accumulator;
    }, 0);
    return count;
}

One way to use the UDF is to query the collection and return the count of matches with each document.

SELECT p, udf.intersectionCount([4,5], p.Attributes) 
FROM Patients p

I can also use the UDF in a WHERE clause.

SELECT *
FROM Patients p
WHERE udf.intersectionCount([1,3], p.Attributes) > 1

The UDF makes the queries easy, but might not be the best approach for performance. You’ll need to evaluate the impact of this approach using your own data and application behavior.


Comments
Gravatar Nican Thursday, August 30, 2018
What is the RU cost of having the udf in the WHERE clause? If there are 1,000s of documents in your database, and udf.intersectionCount always returns 0, are you going to be paying the overhead of scanning every document?
Gravatar scott Friday, August 31, 2018
@Nican: Yes, it can be expensive. The alternative is to re-design (or build a derivative) collection that is optimized for this sort of query.
Gravatar Alan (snowcode) Sunday, September 2, 2018
Good question @Nican,... Surely the only useful discussions around CosmosDB are discussions around how to lower RU. Nothing, absolutely nothing else matters? We absolutely have to bake "efficiency" in all our thinking and make it a first class citizen. Long gone are the days that code can be changed and a quick glance from a second person doing a "code review", is good enough. We have to bake performance (impact) measuring into all our cloud code. I'd be very interested in a discussion on how can we bake measuring RU into our build scripts and have gated check-ins to stop code from increasing RU for a component above a given threashold, or, being able to set the maximum budgted RU for a series of components or sequence of business processes.
Gravatar Alan (snowcode) Monday, September 3, 2018
Hi @Scott, ... just re-read the comment I wrote yesterday, and realise that I forgot to ask or pose a few important questions, specifically "how do you know that your approach is efficient?", How did you defined "efficient"? How does your approach compare to returning all the data and performing the udf client side? Also, was the extra data that would be returned a factor that impacted your definition for "efficient"? (i.e. doing it client side, since the resulting recordset would be 1 record per row entry, you'd be trading off cpu [RU] for network traffic) And lastly, what were the alternatives that a novice CosmosDB user might immediately think of, and why would that be "bad" (less efficient), or more costly RU? I'm also not sure if the bulk of RU is the actual processing, or whether holding a buffer of data while it's streaming counts towards that RU cost? Since the docs claims that RU's are deterministic, I'm doubtful that bandwidth is a part of that calculation. ( In case anyone is reading this discussion wondering what is this RU we're mentioning, here's a link explaining : https://docs.microsoft.com/en-us/azure/cosmos-db/request-units )
Gravatar scott Tuesday, September 4, 2018
@Alan: Honestly, I think it comes down to ROI. If I am spending $1 to make $100 for every user of the system, that could be efficient enough (compared to say, $1 to make $2). Line of business apps are a more difficult calculation since the app isn't making direct revenue, but the more memory you use and the more calculations you need in a query, the higher the RU cost and the more you'll pay per month. If this solution was used in a collection with millions of documents or where the arrays are large (> 20 elements), I'd redesign the collection, or make a derivative.
Comments are closed.