A Simple MapReduce with MongoDB and C#

Monday, March 19, 2012

If you work with relational databases and someone says "data aggregation", you immediately think of a GROUP BY clause and the standard aggregation operators, like COUNT, MIN, and MAX.

MapReduce with MongoDB is also a form of data aggregation where you can take a large amount of information and aggregate (reduce) the information to some smaller amount of information. Before reducing you have the ability translate (map) the information into a structure designed for the custom reduction process. For more details, see Karl Seguin's fabulous work titled The Little MongoDB Book. 

As an example of how to use MapReduce from C#, let's use Movie objects with Title, Category, and Minutes (length) properties.

void AddMovies(MongoCollection<Movie> collection)
{
    var movies = new List<Movie>
    {
        new Movie { Title="The Perfect Developer", 
                    Category="SciFi", Minutes=118 },
        new Movie { Title="Lost In Frankfurt am Main", 
                    Category="Horror", Minutes=122 }, 
        new Movie { Title="The Infinite Standup", 
                    Category="Horror", Minutes=341 } 
    };
    collection.InsertBatch(movies);
}

Let's say we want to find the total number of movies in each category, along with the total length and average length per category. With MongoDB we can do this with a MapReduce operation, and MapReduce requires JavaScript.

The Map

When you tell Mongo to MapReduce, the function you provide as the map function will receive each Movie as the this parameter. The purpose of the map is to exercise whatever logic you need in JavaScript and then call emit 0 or more times to produce a reducible value.

For now we'll leave the JavaScript embedded in the C# code as a string, but we'll look at something nicer next week.

string map = @"
    function() {
        var movie = this;
        emit(movie.Category, { count: 1, totalMinutes: movie.Minutes });
    }";

For each movie we'll emit a key and a value. The key is the first parameter to the emit function and represents how we want to group the values (in this case we are grouping  by category). The second parameter to emit is the value, which in this case is a little object containing the count of movies (always 1) and total length of each individual each movie.

The Reduce

Mongo will group the items you emit and pass them as an array to the reduce function you provide. It's inside the reduce function where you want to do the aggregation calculations and reduce all the objects to a single object. We are using simple logic here, but you can make extremely complex map and reduce functions using all the power of JavaScript.

string reduce = @"        
    function(key, values) {
        var result = {count: 0, totalMinutes: 0 };

        values.forEach(function(value){               
            result.count += value.count;
            result.totalMinutes += value.totalMinutes;
        });

        return result;
    }";

The reduce function returns a single result. It's important for the return value to have the same shape as the emitted values. It's also possible for MongoDB to call the reduce function multiple times for a given key and ask you to process a partial set of values, so if you need to perform some final calculation, you can also give MapReduce a finalize function.

The Finalize

The finalize function is optional, but if you need to calculate something based on a fully reduced set of data, you'll want to use a finalize function. Mongo will call the finalize function after all the reduce calls for a set are complete. This would be the place to calculate the average length of all movies in a category.

string finalize = @"
    function(key, value){
      
      value.average = value.totalMinutes / value.count;
      return value;

    }";

Putting It Together

With the JavaScript in place, all that is left is to tell MongoDB to execute a MapReduce.

var collection = db.GetCollection("movies");
var options = new MapReduceOptionsBuilder();
    options.SetFinalize(finalize);
    options.SetOutput(MapReduceOutput.Inline);
var results = collection.MapReduce(map, reduce, options);

foreach (var result in results.GetResults())
{
    Console.WriteLine(result.ToJson());
}

Which would produce:

 { "_id" : "Horror", 
   "value" : { "count" : 2.0, "totalMinutes" : 463.0, "average" : 231.5 } 
}
{ "_id" : "SciFi", 
  "value" : { "count" : 1.0, "totalMinutes" : 118.0, "average" : 118.0 } 
}

 

Note that you can use GetResultsAs<T> to map the results into .NET objects of type T. You can also have MapReduce store (or merge) the computed results into a collection instead of returning inline results as we have done in the example. Creating a collection from a MapReduce operation is the ideal strategy to use when you need the results frequently. The collection will serve as a cache.


Comments
gravatar Chris Broome Monday, March 19, 2012
Does the finalize method always get called or only called when there are results? I'm just curious because there's no check for a 'count' of 0, so you could get a divide by 0 exception there - or the equivalent in JS - which may or may not be OK, but my gut tells me that a check should be there.
gravatar scott Monday, March 19, 2012
@Chris: It looks like it wouldn't be called, but good catch, it would be good to have some defensive programming there.
gravatar Mike Rodda Tuesday, March 20, 2012
Can this be done with multi-collections?

by the looks of things, the MapReduce(..) function works of a Collection object.

For example: Could you pass the Movie and the MovieStore collection into the MapReduce? I assume not, and you would probably do it a different way (multiple MapReduce function called on each collection)?
gravatar scott Tuesday, March 20, 2012
@Mike: I haven't tried that yet, but it does look like you would MapReduce both collections and use an output type of "Reduce" to combine the results into a single output. See tebros.com/...
Comments are now closed.
by K. Scott Allen K.Scott Allen
My Pluralsight Courses
The Podcast!