I’ve been asked a few times about how to optimize LINQ code. The first step in optimizing LINQ code is to take some measurements and make sure you really have a problem.
It turns out that optimizing LINQ code isn’t that different from optimizing regular C# code. You need to form a hypothesis, make changes, and measure, measure, measure every step of the way. Measurement is important, because sometimes the changes you need to make are not intuitive.
Here is a specific example using LINQ to Objects.
Let’s say we have 100,000 of these in memory:
public class CensusRecord
{
public string District{ get; set; }
public long Males { get; set; }
public long Females { get; set; }
}
We need a query that will give us back a list of districts ordered by their male / female population ratio, and include the ratio in the query result. A first attempt might look like this:
var query =
from r in _censusRecords
orderby (double)r.Males / (double)r.Females descending
select new
{
District = r.District,
Ratio = (double)r.Males / (double)r.Females
};
query = query.ToList();
It’s tempting to look at the query and think - “If we only calculate the ratio once, we can make the query faster and more readable! A win-win!”. We do this by introducing a new range variable with the let clause:
var query =
from r in _censusRecords
let ratio = (double)r.Males / (double)r.Females orderby ratio descending
select new
{
District = r.District,
Ratio = ratio
};
query = query.ToList();
If you measure the execution time of each query on 100,000 objects, however, you’ll find the second query is about 14% slower than the first query, despite the fact that we are only calculating the ratio once. Surprising! See why we need to take measurements?
The key to this specific issue is understanding how the C# compiler introduces the range variable ratio into the query processing. We know that C# translates declarative queries into a series of method calls. Imagine the method calls forming a pipeline for pumping objects. The first query we wrote would translate into the following:
var query =
_censusRecords.OrderByDescending(r => (double)r.Males /
(double)r.Females)
.Select(r => new { District = r.District,
Ratio = (double)r.Males /
(double)r.Females });
The second query, the one with the let clause, is asking LINQ to pass an additional piece of state through the object pipeline. In other words, we need to pump both a CensusRecord object and a double value (the ratio) into the OrderByDescending and Select methods. There is no magic involved - the only way to get both pieces of data through the pipeline is to instantiate a new object that will carry both pieces of data. When C# is done translating the second query, the result looks like this:
var query =
_censusRecords.Select(r => new { Record = r,
Ratio = (double)r.Males /
(double)r.Females })
.OrderByDescending(r => r.Ratio)
.Select(r => new { District = r.Record.District,
Ratio = r.Ratio });
The above query requires two projections, which is 200,000 object instantiations. CLR Profiler says the let version of the query uses 60% more memory.
Now we have a better idea why performance decreased, and we can try a different optimization. We’ll write the query using method calls instead of a declarative syntax, and do a projection into the type we need first, and then order the objects.
var query =
_censusRecords.Select(r => new { District = r.District,
Ratio = (double)r.Males /
(double)r.Females })
.OrderByDescending(r => r.Ratio);
This query will perform about 6% faster than the first query in the post, but consistently (and mysteriously) uses 5% more memory. Ah, tradeoffs.
The moral of the story is not to rewrite all your LINQ queries to save a 5 milliseconds here and there. The first priority is always to build working, maintainable software. The moral of the story is that LINQ, like any technology, requires analysis and measurements to make optimization gains because the path to better performance isn’t always obvious. Also remember that a query “optimized” for LINQ to Objects might make things worse when the same query uses a different provider, like LINQ to SQL.
I recently had some time on airplanes to read through Bitter EJB, POJOs in Action, and Better, Faster, Lighter Java. All three books were good, but the last one was my favorite, and was recommended to me by Ian Cooper. No, I’m not planning on trading in assemblies for jar files just yet. I read the books to get some insight and perspectives into specific trends in the Java ecosystem.
It’s impossible to summarize the books in one paragraph, but I’ll try anyway:
Some Java developers shun the EJB framework so they can focus on objects. Simple objects. Testable objects. Malleable objects. Plain old Java objects that solve business problems without being encumbered by infrastructure and technology concerns.
That’s the gist of the three books in 35 words. The books also talk about patterns, anti-patterns, domain driven design, lightweight frameworks, processes, and generally how to write software. You’d be surprised how much content is applicable to .NET. In fact, when reading through the books I began to think of .NET and Java as two parallel universes whose deviations could be explained by the accidental killing of one butterfly during a time traveling safari.
The focus of this post is one particular deviation that really stood out.
The Java developers who focus on objects eventually have to deal with other concerns like persistence. Their object focus naturally leads some of them to try object-relational mapping frameworks. ORMs like Hibernate not only provide these developers with productivity gains, but do so in a relatively transparent and non-intrusive manner. The two work well together right from the start as the developers understand the ORMs, and the ORMs seem to understand the developers.
.NET includes includes DataSets, DataTables, and DataViews. There is an IDE with a Data menu, and a GUI toolbox with Data tab full of Data controls and DataSources. It’s easy to stereotype mainstream .NET development as data-centric. When you introduce an ORM to a .NET developer who has never seen one, the typical questions are along the lines of:
How do I manage my identity values after an INSERT?
... and ...
Does this thing work with stored procedures?
Perfectly reasonable questions given the data-centric atmosphere of .NET, but you can almost feel the tension in these questions. And that is the deviation that stood out to me. On the airplane, I read about Java developers who focused on objects and went in search of ORMs. In .NET land, I’m seeing the ORMs going in search of the developer who is focused on data. The ORMs in particular are LINQ to SQL (currently shipping in Visual Studio) and the Entity Framework (shipping in SP1). Anyone expecting something like “ADO.NET 3.5” is in for a surprise. Persistent entities and DataSets are two different creatures, and require two different mind sets.
It’s possible, but the tools make it difficult. The Entity Framework, for instance, presents developers with cognitive dissonance at several points. The documentation will tell you the goal of EF is to create a rich, conceptual object model, but the press releases proclaim that the Entity Framework simplifies data-centric development. There will not be any plain old CLR objects (POCOs) in EF, and the object-focused implicit lazy-loading that comes standard in most ORMs isn’t available (you can read any property on this entity, um, except that one – you’ll have to load it first).
LINQ to SQL is different. LINQ to SQL is objects all the way down. You can use plain old CLR objects with LINQ to SQL if you dig beyond the surface. However, the surface is a shiny designer that looks just like the typed DataSet designer. LINQ to SQL also needs some additional mapping flexibility to truly separate the object model from the underlying database schema – hopefully we’ll see this in the next version.
If you are a .NET developer who is starting to use an ORM –any ORM, you owe it to yourself and your project to reset your defaults and think differently about the new paradigm. Forget what you know about DataSets and learn about the unit of work pattern. Forget what you know about data readers and learn how an ORM identity map works. Think objects first, data second. If you can’t think of data second, an ORM might not be the technology for you.
Matt Podwysocki invited me to speak at the D.C. alt.net meeting next Thursday evening (July 24th). The topic is LINQ. Matt specifically requested a code-heavy presentation, so expect two slides followed by plenty of hot lambda and Expression<T> action.
Hopefully, Matt doesn’t blackout the neighborhood like he did at the nearby RockNUG meeting this week. The White House is two blocks away and the people inside get a little jumpy about blackouts.
DateTime:
7/24/2008 - 7PM-9PM
Location:
Cynergy Systems Inc.
1600 K St NW
Suite 300
Washington, DC 20006
Show Map
In the BI space I’ve seen a lot of SQL queries succumb to complexity. A data extraction query adds some joins, then some filters, then some nested SELET statements, and it becomes an unhealthy mess in short order. It’s unfortunate, but standard SQL just isn’t a language geared for refactoring towards simplification (although UDFs and CTEs in T-SQL have helped).
I’ve really enjoyed writing LINQ queries this year, and I’ve found them easy to keep pretty.
For example, suppose you need to parse some values out of the following XML:
<ROOT>
<data>
<record>
<field name="Country">Afghanistan</field>
<field name="Year">1993</field>
<field name="Value">16870000</field>
<!-- ... -->
</record>
<!-- ... -->
</data>
</ROOT>
A first crack might look like the following:
var entries =
from r in doc.Descendants("record")
select new
{
Country = r.Elements("field")
.Where(f => f.Attribute("name") .Value == "Country")
.First().Value,
Year = r.Elements("field")
.Where(f => f.Attribute("name").Value == "Year")
.First().Value,
Value = double.Parse
(r.Elements("field")
.Where(f => f.Attribute("name").Value == "Value")
.First().Value)
};
The above is just a mass of method calls and string literals. But, add in a quick helper or extension method…
public static XElement Field(this XElement element, string name)
{
return element.Elements("field")
.Where(f => f.Attribute("name").Value == name)
.First();
}
… and you can quickly turn the query around into something readable.
var entries =
from r in doc.Descendants("record")
select new
{
Country = r.Field("Country").Value,
Year = r.Field("Year").Value,
Value = double.Parse(r.Field("Value").Value)
};
If only SQL code was just as easy to break apart!
Haiku is a popular poetic form that has evolved over centuries. Restku is Haiku with a twist.
crystal pixels
get brighter
an abundance of excitement
The twist is that the author of a Restku is restricted to using a single verb from this list: get, post, put, and delete. Although traditional Restku insists on present tense usage of the four verbs, adventurous authors will mix in past tense, future tense, and on occasion, present perfect tense.
unexpected dialog
a “progress” bar
vista has posted the bad news
Although Restku was inspired by REST, a software architecture style, there is no reason an author can’t frame concepts from outside the world of information technology into a Restku.
weathered glove
humid skies
put on a childhood dream
Relax your mind with the mental stimulation of writing a Restku today, for tomorrow is still a mystery.
four hundred and four
electrical neurons
delete her memory
Herding Code is a podcast about a variety of topics in technology and software development. It’s done roundtable style with myself, Scott Koon, Kevin Dente, and Jon Galloway. The conversations are a blast, and I hope informative, too.
Tune in to the feed here: http://feeds.feedburner.com/HerdingCode
Salmon swim upstream, and look at what happens …
Every developer is familiar with the “work around”. These are the extra bits of extra code we write to overcome limitations in an API, platform, or framework.
But, sometimes those limitations are a feature. The designer of a framework might be guiding you in a specific direction. Take the Silverlight networking APIs as an example. The APIs provide only asynchronous communication options, yet I’ve seen a few people try to block on network operations with code like the following:
AutoResetEvent _event = new AutoResetEvent(false);
WebClient client = new WebClient();
client.DownloadStringCompleted +=
(s, ev) => { _message.Text = ev.Result; _event.Set(); };
client.DownloadStringAsync(new Uri("foo.xml", UriKind.Relative));
_event.WaitOne();
This code results in a deadlock, since the WebClient tries to raise the completed event on the main thread, but the main thread is blocked inside WaitOne and waiting for the completed event to fire. This deadlock is not only fatal to the Silverlight application, but can bring down the web browser, too. Even if this code didn't create a deadlock, do you really want your application to block over a slow network connection?
When you find yourself writing “work around” code, it’s worthwhile to review the situation. Are you really working around a limitation? Or are you working against the intended use of a framework? Working against the framework is rarely a good idea – there can be a lot of hungry bears waiting to catch you in the future.