I’ve been asked a few times about how to optimize LINQ code. The first step in optimizing LINQ code is to take some measurements and make sure you really have a problem.
It turns out that optimizing LINQ code isn’t that different from optimizing regular C# code. You need to form a hypothesis, make changes, and measure, measure, measure every step of the way. Measurement is important, because sometimes the changes you need to make are not intuitive.
Here is a specific example using LINQ to Objects.
Let’s say we have 100,000 of these in memory:
public class CensusRecord
{
public string District{ get; set; }
public long Males { get; set; }
public long Females { get; set; }
}
We need a query that will give us back a list of districts ordered by their male / female population ratio, and include the ratio in the query result. A first attempt might look like this:
var query =
from r in _censusRecords
orderby (double)r.Males / (double)r.Females descending
select new
{
District = r.District,
Ratio = (double)r.Males / (double)r.Females
};
query = query.ToList();
It’s tempting to look at the query and think - “If we only calculate the ratio once, we can make the query faster and more readable! A win-win!”. We do this by introducing a new range variable with the let clause:
var query =
from r in _censusRecords
let ratio = (double)r.Males / (double)r.Females orderby ratio descending
select new
{
District = r.District,
Ratio = ratio
};
query = query.ToList();
If you measure the execution time of each query on 100,000 objects, however, you’ll find the second query is about 14% slower than the first query, despite the fact that we are only calculating the ratio once. Surprising! See why we need to take measurements?
The key to this specific issue is understanding how the C# compiler introduces the range variable ratio into the query processing. We know that C# translates declarative queries into a series of method calls. Imagine the method calls forming a pipeline for pumping objects. The first query we wrote would translate into the following:
var query =
_censusRecords.OrderByDescending(r => (double)r.Males /
(double)r.Females)
.Select(r => new { District = r.District,
Ratio = (double)r.Males /
(double)r.Females });
The second query, the one with the let clause, is asking LINQ to pass an additional piece of state through the object pipeline. In other words, we need to pump both a CensusRecord object and a double value (the ratio) into the OrderByDescending and Select methods. There is no magic involved - the only way to get both pieces of data through the pipeline is to instantiate a new object that will carry both pieces of data. When C# is done translating the second query, the result looks like this:
var query =
_censusRecords.Select(r => new { Record = r,
Ratio = (double)r.Males /
(double)r.Females })
.OrderByDescending(r => r.Ratio)
.Select(r => new { District = r.Record.District,
Ratio = r.Ratio });
The above query requires two projections, which is 200,000 object instantiations. CLR Profiler says the let version of the query uses 60% more memory.
Now we have a better idea why performance decreased, and we can try a different optimization. We’ll write the query using method calls instead of a declarative syntax, and do a projection into the type we need first, and then order the objects.
var query =
_censusRecords.Select(r => new { District = r.District,
Ratio = (double)r.Males /
(double)r.Females })
.OrderByDescending(r => r.Ratio);
This query will perform about 6% faster than the first query in the post, but consistently (and mysteriously) uses 5% more memory. Ah, tradeoffs.
The moral of the story is not to rewrite all your LINQ queries to save a 5 milliseconds here and there. The first priority is always to build working, maintainable software. The moral of the story is that LINQ, like any technology, requires analysis and measurements to make optimization gains because the path to better performance isn’t always obvious. Also remember that a query “optimized” for LINQ to Objects might make things worse when the same query uses a different provider, like LINQ to SQL.