Lazy LINQ and Enumerable Objects

Someone asked me why LINQ operators return an IEnumerable<T> instead of something more useful, like a List<T>. In other words, in the following code:

List<Book> books = new List<Book>();
// ...
IEnumerable<Book> filteredBooks = 
    books.Where(book => book.Title.StartsWith("R"));

... we started with a List<Book>, so why isn’t the Where operator smart enough to return a new List<Book>, or modify the existing list by removing books that don’t match the Where condition?

Let’s talk about modifying the original list.

I hope you’ll agree that it would be odd for a query to modify a data source. Imagine sending a SELECT statement to a database and finding out later your SELECT removed all but one record from a table. Although the Where operator is just a method call that could change the underlying list of books, it’s better to return something new and leave the original list intact. You won’t find any LINQ operators that modify input , and this behavior produces many benefits. One obvious benefit is that you can think of the above code as a query, and it won’t surprise you by removing books from your original list.

What about creating a new List<T>?

It turns out that creating a new list can be quite expensive, but only because we often don’t need a List<T> returned. Think about the number of lists created in the following query (if each operator created a new list).

var filteredBooks =
        books.Where(book => book.Title.StartsWith("R"))
             .OrderBy(book => book.Published.Year)
             .Select(book => book.Title);

Here we would have three lists created (one each by the Where, OrderBy, and Select operators). We’d only needed a single list of titles as the result, but since the operators cannot modify their underlying data source (for the reasons outlined above), they would be forced to each create a new list, and most of that work would be wasted as two of the lists are immediately discarded.

Imagine if you only needed to count the number of books whose title starts with the letter R – in that case you wouldn’t want any of these lists being created and destroyed when you computed the result. It would all be wasted work.

Laziness Isn’t Always Bad

One of the principles of LINQ is to be lazy. A LINQ query won’t do any work unless you force the query to do the work. Even when a query does perform work – it does the least amount of work possible. If you wanted a List<Book> as a result, you’d have to force LINQ to create the list:

List<Book> filteredBooks =
               books.Where(book => book.Title.StartsWith("R"))
                    .ToList();

 

Instead of lists, LINQ works with a beautifully pure abstraction called IEnumerable<T>. It’s defined like so:

public interface IEnumerable<T> : IEnumerable
{
    IEnumerator<T> GetEnumerator();
}

The only thing you can do with an IEnumerable<T> is ask for an enumerator. An enumerator is something that knows how to visit each item in a collection. Some languages call these enumerator things “iterators”, because they iterate over a collection of objects, returning each object and moving to the next.

The in-memory LINQ operators, like Where, OrderBy, and Select, all work on inputs that implement IEnumerable<T>. That means the operators work on anything that can be enumerated over in a one-by-one fashion. The beauty is that there are so many data sources that these operators can work on, because IEnumerable<T> has such simple demands. Arrays are enumerable, lists are enumerable, dictionaries, trees, stacks, queues, files in a directory, elements in an XML document - all enumerable. Even a simple string is enumerable, since it is composed from a sequence of individual characters.

These same operators return IEnumerable<T>, because it’s the lowest common denominator for everything you’d ever need from a query. Plus, it’s lazy. You want to count the results? The Count operator will get the enumerator and move through the results, one by one, to sum the total number of items. You want to make a concrete list from the results? The ToList operator will get the enumerator and move through the results, one by one, adding each item to a new list it creates. Do you want just the first item in the results? Then the enumerator does just a little bit of work to find that first item. In most cases it does not need to iterate the entire collection to find the first item. Enumerators are lazy, too.

The important point is that the enumerator itself doesn’t perform any useful work. It’s you, or the other LINQ operators, that use the enumerator to iterate through the result and produce something meaningful. In the odd case that you never need to look at the result - no enumeration work is performed at all. No lists area created. Pure laziness!

Summary

The beauty of IEnumerable<T> is that it only says “you can get something to enumerate this”. To return something that offers the possibility of enumeration is very little work. And no work is needed unless you actually count the results, create a list from the results, or bind the results to a control for display.

The interface IEnumerable<T> is so wonderfully lazy it inspired me to write a short, short story. If you came to read about LINQ, skip the story as the words are entirely uninteresting and mostly devoid of meaning. 

The Lazy Leopard and the List

lazy leopardThe scientist approached the big cat with a notepad and a pencil in her hands. She was worried, of course. The cat was a predator, and likely to be hungry at this hour of the day. “I need to know”, she asked the cat, “do you keep a list of things to do each day?”

The cat stirred. He was a snow leopard with dark rosettes blotted onto his thick, cream colored fur. The big cat’s eyes were only half open, but he turned and focused them on her.

“I don’t keep a list, dear lady”, he said, followed by a rumbling yawn. “I keep an enumeration”.

“An enumeration?”, she asked.

“Yes, an enumeration”, he replied. “Lists are like the gold bracelet on your wrist, dear lady. Very tangible – very concrete things, lists are. Keeping a list of everything I might want to do is a burden and chore. I’d need to carry your paper, and your pencil”, he said, with his eyes focusing on her hands.

The scientist’s pencil raced across her notebook as she transcribed every word the leopard spoke. She glanced at him as he began to stare, and instinctively pulled herself a little further away.

The leopard continued. “With a list I’d have to add things, and remove things, and constantly reorder the things I want to do. Too much work”, he said, shaking his head. “Do you know what I can do with an enumeration?”, he asked.

She paused at the leopard’s question, and pushed her hair back  - she wore glasses when she worked. After some thought, she asked, “Enumerate it?”

“Yes, dear lady”, said the leopard. “I can enumerate it. I enumerate the possibilities one by one, and find the perfect fit for this moment in my life. If I’m thirsty, I’ll find water. If I’m sleepy, I’ll find a place to sleep.” He tilted his head slightly to the right. “If I’m hungry, I’ll find food”, he said.

She finished writing the leopard’s last words and glanced up. Was that a tooth showing? Was he hungry now?

The cat started speaking again.

“One of the wonderful things about enumerations is they theoretically last forever. Lists have a beginning and an end – an Omega for every Alpha. With enumerations, you can keep asking for the next thing, over and over and over again. I ask for them when I’m ready to do something. If I’m tired of doing, they’ll still be there tomorrow. You might say it’s unpredictable behavior, I say I’m just being lazy. Either way, I can’t help it, it’s in my genes”. His soft voice trailed off with a tired tone.

“You intended to live forever?”, she asked. The leopard snarled. Or smiled. She couldn’t quite tell.

“No, dear lady”, he said. “I said the enumeration theoretically lasts forever. One day I’m sure my enumeration will run out of things to give me, or maybe I’ll just be too tired to ask for the next thing, so I’ll sleep forever. I don’t know how it ends. Maybe I should ask you.”

She looked at him again. She felt uneasy now, being here with a leopard. He seemed nice enough, as leopards go, and he certainly gave her interesting topics for research, but he was still a leopard. A carnivore. He was not a beast to be trifled with. She could never let her guard down again.

“I don’t know how it ends either”, she said, and closed her notepad. She tucked her pencil behind her ear, backed away from the cage, and left the leopard alone with his enumeration.

posted by scott with 3 Comments

What’s Wrong With This Code? (#20)

Mike had to model answers. Yes or no answers, date and time answers - all sorts of answers. One catch was that any answer could be “missing” or could be “empty”. Both values had distinct meanings in the domain. An interface definition fell out of the early iterative design work:

public interface IAnswer
{
    bool IsMissing { get; }
    bool IsEmpty { get; }
}

Mike was prepared to implement a DateTimeAnswer class, but first a test:

[TestMethod]
public void Can_Represent_Empty_DateTimeAnswer()
{
    DateTimeAnswer emptyAnswer = new DateTimeAnswer();
    Assert.IsTrue(emptyAnswer.IsEmpty);
}

After a little work, Mike had a class that could pass the test:

public class DateTimeAnswer : IAnswer
{       
    public bool IsEmpty
    {
        get { return Value == _emptyAnswer; }
    }

    public bool IsMissing
    {
        get { return false; } // todo 
    }

    public DateTime Value { get; set; }

    static DateTime _emptyAnswer = DateTime.MinValue;
    static DateTime _missingAnswer = DateTime.MaxValue;
}

After sitting back and looking at the code, Mike realized there were a couple facets of the class he didn’t like:

  • A client of the class needed to know which values of DateTime were used internally to represent empty and missing answers.  
  • The class felt like it should produce immutable objects, and thus the set-able Value property felt wrong.

Mike returned to his test project, and changed his first test to agree with his idea of how the class should work. Mike figured adding a couple well known DateTimeAnswer objects (named Empty and Missing) would get rid of the magic DateTime values in client code.

[TestMethod]
public void Can_Represent_Empty_DateTimeAnswer()
{
    DateTimeAnswer emptyAnswer = DateTimeAnswer.Empty;
    Assert.IsTrue(emptyAnswer.IsEmpty);
}

Feeling pretty confident, Mike returned to his DateTimeAnswer class and added a constructor, changed the Value property to use a protected setter, implemented IsMissing, and published the two well known DateTimeAnswer objects based on his previous code:

public class DateTimeAnswer : IAnswer
{       
    public DateTimeAnswer (DateTime value)
    {
        Value = value;
    }

    public bool IsEmpty
    {
        get { return Value == _emptyAnswer; }
    }

    public bool IsMissing
    {
        get { return Value == _missingAnswer; }
    }

    public DateTime Value { get; protected set; }
    public static DateTimeAnswer Empty = new DateTimeAnswer(_emptyAnswer);
    public static DateTimeAnswer Missing = new DateTimeAnswer(_missingAnswer);
    static DateTime _emptyAnswer = DateTime.MinValue;
    static DateTime _missingAnswer = DateTime.MaxValue;    
}

Mike’s test passed. Mike was so confident about his class he never wrote a test for IsMissing. It was just too easy – what could possible go wrong? Imagine his surprise when someone else wrote the following test, and it failed!

[TestMethod]
public void Can_Represent_Missing_DateTimeAnswer()
{
    DateTimeAnswer missingAnswer = DateTimeAnswer.Missing;
    Assert.IsTrue(missingAnswer.IsMissing);
}

What went wrong?

posted by scott with 11 Comments

Stupid LINQ Tricks

Over a month ago I did a presentation on LINQ and promised a few people I’d share the code from the session. Better late than never, eh?

We warmed up by building our own filtering operator to use in a query. The operator takes an Expression<Predicate<T>>, which we need to compile before we invoking the predicate inside.

public static class MyExtensions
{
    public static IEnumerable<T> Where<T>(
                  this IEnumerable<T> sequence,
                  Expression<Predicate<T>> filter)
    {
        foreach (T item in sequence)
        {
            if (filter.Compile()(item))
            {
                yield return item;
            }
        }
    }
}

The following query uses our custom Where operator:

IEnumerable<Employee> employees = new List<Employee>()
{
    new Employee() { ID= 1, Name="Scott" },
    new Employee() { ID =2, Name="Paul" }
};


Employee scott =
    employees.Where(e => e.Name == "Scott").First();

Of course, if we are just going to compile and invoke the expression there is little advantage to using an Expression<T>, but it generally turns into an “a-ha!” moment when you show someone the difference between an Expression<Predicate<T>> and a plain Predicate<T>. Try it yourself in a debugger.

We also wrote a LINQ version of “Hello, World!” that reads text files from a temp directory (a.txt would contain “Hello,”, while b.txt would contain “World!”. A good demonstration of map-filter-reduce with C# 3.0.

var message = Directory.GetFiles(@"c:\temp\")
                       .Where(fname => fname.EndsWith(".txt"))
                       .Select(fname => File.ReadAllText(fname))
                       .Aggregate(
                           new StringBuilder(),
                          (sb, s) => sb.Append(s).Append(" "),
                          sb => sb.ToString()
                       );


Console.WriteLine(message);

Moving into NDepend territory, we also wrote a query to find the namespaces with the most types (for referenced assemblies only):

var groups = Assembly.GetExecutingAssembly()
         .GetReferencedAssemblies()
         .Select(aname => Assembly.Load(aname))
         .SelectMany(asm => asm.GetExportedTypes())
         .GroupBy(t => t.Namespace)
         .OrderByDescending(g => g.Count())
         .Take(10);

foreach (var group in groups)
{
    Console.WriteLine("{0} {1}", group.Key, group.Count());
    foreach (var type in group)
    {
        Console.WriteLine("\t" + type.Name);
    }
}

And finally, some LINQ to XML code that creates an XML document out of all the executing processes on the machine:

XNamespace ns = "http://odetocode.com/schemas/linqdemo";
XNamespace ext = "http://odetocode.com/schemas/extensions";

XDocument doc =
    new XDocument(
        new XElement(ns + "Processes",
            new XAttribute(XNamespace.Xmlns + "ext", ext),
            from p in Process.GetProcesses()
            select new XElement(ns + "Process",
               new XAttribute("Name", p.ProcessName),
               new XAttribute(ext + "PID", p.Id))));

Followed by a query for the processes ID of any mspaint instances:

var query =
   (from e in doc.Descendants(ns + "Process")
    where (string)e.Attribute("Name") == "mspaint"
    select (string)e.Attribute(ext + "PID"));

More on LINQ to come…

posted by scott with 8 Comments

Visual Studio SP1 and The Metification of REST

Metification – verb

  1. The act of adding metadata to a web service in order to facilitate tooling and discovery.
  2. The act of adding complexity to a web service in order to achieve tight coupling.

Pick one.

Service Pack 1 for Visual Studio 2008 has just arrived with new features, including version 1.0 of ADO.NET Data Services (a.k.a Astoria). From the description (highlighting is mine):

ADO.NET Data Services … consists of a combination of patterns and libraries that enables any data store to be exposed as a flexible data service, naturally integrating with the Web, that can be consumed by Web clients within a corporate network or across the Internet. ADO.NET Data Services uses URIs to point to pieces of data and simple, well-known formats to represent that data, such as JSON and ATOM/APP. This results in data being exposed to Web clients as a REST-style resource collection, addressable with URIs that agents can interact with using standard HTTP verbs such as GET, POST, or DELETE.

Compared to the traditional SOAP approach, the REST-style is a different model for exposing functionality over a web service. Instead of defining messages and exposing operations that act on those messages, you expose resources and act on the resources using common HTTP verbs. I’ve lately been thinking of SOAP based web services as “verb oriented” (exposing GetOrder and UpdateCustomer), while REST style web services are “noun oriented” (exposing Orders and Customers). Both models have advantages and disadvantages, but I’ve felt that REST partners well with rich, Internet applications that need to retrieve a variety of resources  using the same filtering and paging parameters. Creating a heap of GetThisByThat operations is tedious. 

Noun and verbs aren’t the only difference between REST and SOAP. One of the primary strengths of REST is its inherent simplicity. The simplicity not only facilitates broad interoperability, but encourages an acceptance of REST from many who feel overwhelmed by the complexities of WS-*. There are no tools required for REST - all you need is the ability to send an HTTP request and read the response. WS-*, on the other hand, is great when you need a digitally signed message including double-secret user credentials routed through an asynchronous and distributed, two-phase commit transaction with an extended buyer protection. Not everyone needs that flexibility, but you still pay the price for the flexibility when using the tooling and the API, and when configuring the service.

Although we could continue talking about differences in REST and SOAP, I wanted to talk about metadata, and Astoria.

Metafication

REST proponents, as a rule of thumb, shun metadata – but not all forms of metadata. Metadata in prose or written documentation is fine. Metadata in a self-describing response format is fine. However, metadata for tooling is seen by many as pure evil. Part of the complexity in WS-* is in the quirky and convoluted folds of metadata formats like WSDL and XML Schema. REST has seen some attempts at standardized metadata (WADL, WSDL 2.0, XSD), but still resists all attempts for the most part. 

I like metadata. Maybe I’ve been in the .NET ecosystem for so long that I expect tooling, but I still remember the first time I tried to write a program for the Flickr web service (which is technically just POX). I was shocked when I coudn’t find a WSDL file. Then I was surprised at how easy it was to craft the correct URL for an HTTP request, and shred apart the XML response to find photographs. It was so easy that ... well, it was just too easy. It reminded me of writing data access code from scratch. Data access code is so predictable and repetitive that we have tools, frameworks, and code generators to take care of the job. But those tools, frameworks, and code generators rely on metadata defined by a database schema, so their job is relatively straightforward. REST is a bit different, unless you are working with Astoria on the server and a CLR client.

Let’s say you have some DTOs for employees, orders, and other objects you want to send over the wire. You’ll need to decorate them with enough information for the service to understand the primary key.

[DataServiceKey("ID")]
public class Employee
{

public int ID { get; set; }
public string Name { get; set; }
}

[DataServiceKey("ID")]
public class Order
{ // …
}

Next, define a class with public IQueryable<T> properties for each “entity set” (Employees and Orders). IQueryable<T> is easy to conjure up, and the class below represents a read-only data source with some fake in-memory data. If you need create, update, and delete functionality the class will need to implement IUpdateable, too. Sean Wildermuth has a three series blog post about IUpdateable that he wrote when implementing IUpdateable for the NHibernate LINQ project.

public class AcmeData 
{    
public
IQueryable<Employee> Employees
{
get
{ return new List<Employee>
{
new Employee() /* ... */,
new Employee() /* ... */,
new Employee() /* ... */
// ...
}.AsQueryable();
}
}

public
IQueryable<Order> Orders
{
// ...
}
// ...
}

Then you need an .svc file…

<%@ ServiceHost Language="C#" 
Factory="System.Data.Services.DataServiceHostFactory,
System.Data.Services,
Version=3.5.0.0, Culture=neutral,
PublicKeyToken=b77a5c561934e089
"
Service="AcmeDataService" %>

… and you’ll also need a code-behind file for the .svc (which is all setup for you using an ADO.NET data service template, you just add some configuration):

public class AcmeDataService : DataService<AcmeData>
{
public static void InitializeService(IDataServiceConfiguration config)
{
config.SetEntitySetAccessRule("Employees", EntitySetRights.AllRead);
// ... more rules
}
}

At this point you can start testing the service using a web browser and looking at, for example, http://localhost/AcmeDataService.svc/Employees. What is more interesting is looking at http://localhost/AcmeDataService.svc/$metadata, because there you’ll find service metadata, which is where the magic starts.

To consume the service, right-click on a project in Visual Studio and select “Add Service Reference…”. Yes – the same “Add Service Reference” command you might have seen in the hit motion picture “SOAP and WSDL – an XML Love Story”. This feature blurs the lines between REST and WS-*. Enter the root URL to the service and Visual Studio will generate a proxy – but not the type of proxy you receive when using SOAP based web services. This proxy will derive from DataServiceContext class and you can use it like so:

var employees = new AcmeData(serviceRoot)
.Employees
.Where(e => e.Name == "Scott")
.OrderBy(e => e.Name)
.Skip(2)
.Take(2)
.ToList();

DataServiceContext does a little bit of magic to turn the LINQ query into the following HTTP request. It’s LINQ to REST:

GET /AcmeDataService.svc/Employees()
?$filter=Name%20eq%20'Scott'&$orderby=Name&$skip=2&$top=2 HTTP/1.1
User-Agent: Microsoft ADO.NET Data Services
Accept: application/atom+xml,application/xml

The data service will respond with some XML that the data context uses to create objects that look just like the server side DTOs.

I’m sure some are horrified at this metification of REST, but for scenarios when you need to talk between two CLR appdomains (think ASP.NET and Silverlight), this approach gives you the advantages of thinking about nouns in a RESTful model without writing all the glue code to wire up an endpoints and parse XML. Beauty!

posted by scott with 6 Comments

Optimizing LINQ Queries

I’ve been asked a few times about how to optimize LINQ code. The first step in optimizing LINQ code is to take some measurements and make sure you really have a problem. 

premature

 

It turns out that optimizing LINQ code isn’t that different from optimizing regular C# code. You need to form a hypothesis, make changes, and measure, measure, measure every step of the way. Measurement is important, because sometimes the changes you need to make are not intuitive.

Here is a specific example using LINQ to Objects.

Let’s say we have 100,000 of these in memory:

public class CensusRecord
{
public string District{ get; set; }
public long Males { get; set; }
public long Females { get; set; }
}

We need a query that will give us back a list of districts ordered by their male / female population ratio, and include the ratio in the query result. A first attempt might look like this:

var query =
from r in _censusRecords
orderby (double)r.Males / (double)r.Females descending
select new
{
District = r.District,
Ratio = (double)r.Males / (double)r.Females
};

query = query.ToList();

It’s tempting to look at the query and think - “If we only calculate the ratio once, we can make the query faster and more readable! A win-win!”. We do this by introducing a new range variable with the let clause:

var query =
from r in _censusRecords
let ratio = (double)r.Males / (double)r.Females orderby ratio descending
select new
{
District = r.District,
Ratio = ratio
};

query = query.ToList();

If you measure the execution time of each query on 100,000 objects, however, you’ll find the second query is about 14% slower than the first query, despite the fact that we are only calculating the ratio once. Surprising! See why we need to take measurements?

Look At Time and Space

The key to this specific issue is understanding how the C# compiler introduces the range variable ratio into the query processing. We know that C# translates declarative queries into a series of method calls. Imagine the method calls forming a pipeline for pumping objects. The first query we wrote would translate into the following:

var query =
_censusRecords.OrderByDescending(r => (double)r.Males /
(double)r.Females)
.Select(r => new { District = r.District,
Ratio = (double)r.Males /
(double)r.Females });

The second query, the one with the let clause, is asking LINQ to pass an additional piece of state through the object pipeline. In other words, we need to pump both a CensusRecord object and a double value (the ratio) into the OrderByDescending and Select methods. There is no magic involved - the only way to get both pieces of data through the pipeline is to instantiate a new object that will carry both pieces of data. When C# is done translating the second query, the result looks like this:

var query =
_censusRecords.Select(r => new { Record = r,
Ratio = (double)r.Males /
(double)r.Females })
.OrderByDescending(r => r.Ratio)
.Select(r => new { District = r.Record.District,
Ratio = r.Ratio });

clr profiler results

The above query requires two projections, which is 200,000 object instantiations.  CLR Profiler says the let version of the query uses 60% more memory.

Now we have a better idea why performance decreased, and we can try a different optimization. We’ll write the query using method calls instead of a declarative syntax, and do a projection into the type we need first, and then order the objects.

var query =
_censusRecords.Select(r => new { District = r.District,
Ratio = (double)r.Males /
(double)r.Females })
.OrderByDescending(r => r.Ratio);

This query will perform about 6% faster than the first query in the post, but consistently (and mysteriously) uses 5% more memory. Ah, tradeoffs.

Moral Of The Story?

The moral of the story is not to rewrite all your LINQ queries to save a 5 milliseconds here and there. The first priority is always to build working, maintainable software. The moral of the story is that LINQ, like any technology, requires analysis and measurements to make optimization gains because the path to better performance isn’t always obvious. Also remember that a query “optimized” for LINQ to Objects might make things worse when the same query uses a different provider, like LINQ to SQL.

posted by scott with 8 Comments

Using an ORM? Think Objects!

I recently had some time on airplanes to read through Bitter EJB, POJOs in Action, and  Better, Faster, Lighter Java. All three books were good, but the last one was my favorite, and was recommended to me by Ian Cooper. No, I’m not planning on trading in assemblies for jar files just yet. I read the books to get some insight and perspectives into specific trends in the Java ecosystem. A Sound Of Thunder

It’s impossible to summarize the books in one paragraph, but I’ll try anyway:

Some Java developers shun the EJB framework so they can focus on objects. Simple objects. Testable objects. Malleable objects. Plain old Java objects that solve business problems without being encumbered by infrastructure and technology concerns.

That’s the gist of the three books in 35 words. The books also talk about patterns, anti-patterns, domain driven design, lightweight frameworks, processes, and generally how to  write software. You’d be surprised how much content is applicable to .NET. In fact, when reading through the books I began to think of .NET and Java as two parallel universes whose deviations could be explained by the accidental killing of one butterfly during a time traveling safari.

The focus of this post is one particular deviation that really stood out.

From Objects To ORMs

The Java developers who focus on objects eventually have to deal with other concerns like persistence. Their  object focus naturally leads some of them to try object-relational mapping frameworks. ORMs like Hibernate not only provide these developers with productivity gains, but do so in a relatively transparent and non-intrusive manner. The two work well together right from the start as the developers understand the ORMs, and the ORMs seem to understand the developers.

From DataSets to ORMs

.NET includes includes DataSets, DataTables, and DataViews. There is an IDE with a Data menu, and a GUI toolbox with Data tab full of Data controls and DataSources. It’s easy to stereotype mainstream .NET development as data-centric. When you introduce an ORM to a .NET developer who has never seen one, the typical questions are along the lines of:

How do I manage my identity values after an INSERT?

... and ...

Does this thing work with stored procedures?

Perfectly reasonable questions given the data-centric atmosphere of .NET, but you can almost feel the tension in these questions. And that is the deviation that stood out to me. On the airplane, I read about Java developers who focused on objects and went in search of ORMs. In .NET land, I’m seeing the ORMs going in search of the developer who is focused on data. The ORMs in particular are LINQ to SQL (currently shipping in Visual Studio) and the Entity Framework (shipping in SP1). Anyone expecting something like “ADO.NET 3.5” is in for a surprise. Persistent entities and DataSets are two different creatures, and require two different mind sets.

Will .NET Developers Focus On Objects Now?

It’s possible, but the tools make it difficult. The Entity Framework, for instance, presents developers with cognitive dissonance at several points. The documentation will tell you the goal of EF is to create a rich, conceptual object model, but the press releases proclaim that the Entity Framework simplifies data-centric development.  There will not be any plain old CLR objects (POCOs) in EF, and the object-focused implicit lazy-loading that comes standard in most ORMs isn’t available (you can read any property on this entity, um, except that one – you’ll have to load it first).

LINQ to SQL is different. LINQ to SQL is objects all the way down. You can use plain old CLR objects with LINQ to SQL if you dig beyond the surface. However, the surface is a shiny designer that looks just like the typed DataSet designer. LINQ to SQL also needs some additional mapping flexibility to truly separate the object  model from the underlying database schema – hopefully we’ll see this in the next version.

What To Do?

If you are a .NET developer who is starting to use an ORM –any ORM, you owe it to yourself and your project to reset your defaults and think differently about the new paradigm. Forget what you know about DataSets and learn about the unit of work pattern. Forget what you know about data readers and learn how an ORM identity map works. Think objects first, data second. If you can’t think of data second, an ORM might not be the technology for you. 

posted by scott with 11 Comments

LINQ Deep Dive at D.C. ALT.NET Next Week

Matt Podwysocki invited me to speak at the D.C. alt.net meeting next Thursday evening (July 24th). The topic is LINQ. Matt specifically requested a code-heavy presentation, so expect two slides followed by plenty of hot lambda and Expression<T> action.

Hopefully, Matt doesn’t blackout the neighborhood like he did at the nearby RockNUG meeting this week. The White House is two blocks away and the people inside get a little jumpy about blackouts.

 

DateTime:
7/24/2008 - 7PM-9PM

Location:
Cynergy Systems Inc.
1600 K St NW
Suite 300
Washington, DC 20006
Show Map

posted by scott with 7 Comments

Keeping LINQ Code Healthy

In the BI space I’ve seen a lot of SQL queries succumb to complexity. A data extraction query adds some joins, then some filters, then some nested SELET statements, and it becomes an unhealthy mess in short order. It’s unfortunate, but standard SQL just isn’t a language geared for refactoring towards simplification (although UDFs and CTEs in T-SQL have helped).

I’ve really enjoyed writing LINQ queries this year, and I’ve found them easy to keep pretty.

For example, suppose you need to parse some values out of the following XML:

<ROOT>
<
data>
<
record>
<
field name="Country">Afghanistan</field>
<
field name="Year">1993</field>
<
field name="Value">16870000</field>
<!--
... -->
</
record>
<!--
... -->
</
data>
</
ROOT>

A first crack might look like the following:

var entries =
from r in doc.Descendants("record")
select new
{
Country = r.Elements("field")
.Where(f => f.Attribute("name") .Value == "Country")
.First().Value,
Year = r.Elements("field")
.Where(f => f.Attribute("name").Value == "Year")
.First().Value,
Value = double.Parse
(r.Elements("field")
.Where(f => f.Attribute("name").Value == "Value")
.First().Value)
};

The above is just a mass of method calls and string literals. But, add in a quick helper or extension method…

public static XElement Field(this XElement element, string name)
{
return element.Elements("field")
.Where(f => f.Attribute("name").Value == name)
.First();
}

… and you can quickly turn the query around into something readable.

var entries =
from r in doc.Descendants("record")
select new
{
Country = r.Field("Country").Value,
Year = r.Field("Year").Value,
Value = double.Parse(r.Field("Value").Value)
};

If only SQL code was just as easy to break apart!

posted by scott with 2 Comments

Restku

Haiku is a popular poetic form that has evolved over centuries. Restku is Haiku with a  twist.

crystal pixels
get brighter
an abundance of excitement

The twist is that the author of a Restku is restricted to using a single verb from this list: get, post, put, and delete. Although traditional Restku insists on present tense usage of the four verbs, adventurous  authors will mix in past tense, future tense, and on occasion, present perfect tense.

unexpected dialog
a “progress” bar
vista has posted the bad news

Although Restku was inspired by REST, a software architecture style,  there is no reason an author can’t frame concepts from outside the world of information technology into a Restku.

weathered glove
humid skies
put on a childhood dream

Relax your mind with the mental stimulation of writing a Restku today, for tomorrow is still a mystery.

four hundred and four
electrical neurons
delete her memory
posted by scott with 2 Comments

Herding Code

herdingcode Herding Code is a podcast about a variety of topics in technology and software development. It’s done roundtable style with myself, Scott Koon, Kevin Dente, and Jon Galloway. The conversations are a blast, and I hope informative, too.

Tune in to the feed here: http://feeds.feedburner.com/HerdingCode

posted by scott with 2 Comments

Swimming Upstream Is Hazardous

Salmon swim upstream, and look at what happens …

    

salmon

Every developer is familiar with the “work around”. These are the extra bits of extra code we write to overcome limitations in an API, platform, or framework.

But, sometimes those limitations are a feature. The designer of a framework might be guiding you in a specific direction. Take the Silverlight networking APIs as an example. The APIs provide only asynchronous communication options, yet I’ve seen a few people try to block on network operations with code like the following:

AutoResetEvent _event = new AutoResetEvent(false);
WebClient client = new WebClient();
client.DownloadStringCompleted +=
(s, ev) => { _message.Text = ev.Result; _event.Set(); };
client.DownloadStringAsync(new Uri("foo.xml", UriKind.Relative));
_event.WaitOne();

This code results in a deadlock, since the WebClient tries to raise the completed event on the main thread, but the main thread is blocked inside WaitOne and waiting for the completed event to fire. This deadlock is not only fatal to the Silverlight application, but can bring down the web browser, too. Even if this code didn't create a deadlock, do you really want your application to block over a slow network connection?

When you find yourself writing “work around” code, it’s worthwhile to review the situation. Are you really working around a limitation? Or are you working against the intended use of a framework? Working against the framework is rarely a good idea – there can be a lot of hungry bears waiting to catch you in the future.

posted by scott with 3 Comments

Pluralsight 2.0

Pluralsight has a new website, and the new site includes some online training options! See Fritz’s post for more details. Be sure to check out one of the newest classes - the LINQ Fundamentals course, too.
posted by scott with 2 Comments

Rob's Not So Lazy MVC Storefront

Rob ran into some lazy load problems in his MVC Storefront and later proclaimed:

"…if you set any Enumerable anything as a property, it's Count property will be accessed when you load the parent object. This negates using any deferred loading for any POCOs, period"

Rob thought this was a problem with .NET in general, but I was suspicious. Veeery suspicious. I downloaded Rob's latest bits and found some interesting behavior.

Based on the screen shot of the call stack that Rob posted, it appeared LINQ to SQL was doing some type conversions. If you poke around the classes mentioned in the call stack, you'll eventually wander into a GenerateConvertToType method that uses LCG to build dynamic methods. Just based on the opening conditional logic, I thought Rob might solve his problem by using LazyList<T> for his business object properties, too (whether or nor he'd want to is a different question), so I modified his Category class for a few experiments to see what would really lazy load.

public class Category {

    
// rob's original
    public IList<Product> Products { get; set; }
    
    
// experimental
    public LazyList<Product> ProductsLazy { get; set; }
    
public IQueryable<Product> ProductsQueryable { get; set; }
    
public IEnumerable<Product> ProductsEnumerable { get; set; }

    
// ...

This was in hopes that LINQ to SQL wouldn't feel compelled to do a conversion via List<T>. I just needed to tweak the query to set all four properties.

var result = from c in db.Categories
            
join cn in culturedName on c.CategoryID equals cn.CategoryID
            
let products = from p in GetProducts()
                 
             join cp in db.Categories_Products
                    
            on p.ID equals cp.ProductID
                      
     where cp.CategoryID == c.CategoryID
                
            select p
             select new Category
             {
                 ID = c.CategoryID,
                 Name = cn.CategoryName,
                 ParentID = c.ParentID ?? 0,
                 Products =
new LazyList<Product>(products),
                 ProductsQueryable = products,
                 ProductsEnumerable = products.AsEnumerable(),
                 ProductsLazy =
new LazyList<Product>(products)          
             };
             return result;

This experiment failed in a stunning fashion, because none of the Product properties lazy loaded – they all eagerly populated themselves full of real product objects. Hmmm.

Slight Detour

Watching SQL Profiler, I started to wonder why there were soooo many queries running. Sure, the stuff wasn't lazy loading but the queries were flying by quicker than eggs at a Steve Ballmer talk. Yet, the code that was kicking off the whole process was just looking for a single category:

Category result = _repository.GetCategories()
                             .WithCategoryID(id)
                             .SingleOrDefault();

That problem turned out to be in Rob's WithCategoryID extension method.

public static IEnumerable<Category> WithCategoryID(
    
                                 this IEnumerable<Category> qry, int ID) {

    
return from c in qry
          
where c.ID == ID
          
select c;
}

By taking an IEnumerable<T> parameter, the extension method was forcing the query to execute and then doing all the ID checks using LINQ to Objects. Just switching over to IQueryable<T> made the method a lot more efficient, and the number of queries came down tremendously.

Correlating Problems

Back to the original problem, which was a bit of a mystery because I've been able to lazy load collections using IEnumerable<T> and IQueryable<T>. After some more fiddling, I began to suspect the query itself. The query uses a correlated subquery by virtue of the fact that the range variable c is used inside the query for products (c.CategoryID). I'm guessing that LINQ to SQL felt compelled to take care of all the work in one fell swoop. Instead of using a subquery, I presented LINQ to SQL with a method call that pushed the needed parameter (c.CategoryID) onto the stack, and made things slightly more readable in the process.

       var result = from c in db.Categories
                    
  join cn in culturedName
                       
on c.CategoryID equals cn.CategoryID
                   

                    let
products = GetProducts(c.CategoryID)
                   

                    select
new Category
                    {
                        ID = c.CategoryID,
                        Name = cn.CategoryName,
                        ParentID = c.ParentID ?? 0,
                        Products =
new LazyList<Product>(products),
                        ProductsQueryable = products,
                        ProductsEnumerable = products.AsEnumerable(),
                        ProductsLazy =
new LazyList<Product>(products)                    
                    };
      
return result;

   }

  
public IQueryable<Product> GetProducts(int categoryID)
   {
      
var products = from p in GetProducts()
                      
join cp in db.Categories_Products on p.ID equals cp.ProductID
                      
where cp.CategoryID == categoryID
                      
select p;
      
return products;
   }

And voila! Three of the properties (ProductsQueryable, ProductsEnumerable, ProductsLazy) would lazy load their Products from the database. Only the original IList<Product> property would eagerly fetch data. From what I can decipher in the grungy code, when LINQ to SQL sees it needs to assign to an IList<T>, and it doesn't have an IList<T>, it eagerly loads a new List<T> and copies those elements into the destination. At least, that's my theory.

Knowing what I know now, I could tell Rob to stick with IList<T> as his property type, but to make sure he has IList<T> on both sides of the assignment in his projection (and tuck the product query into a method call). In other words, use the following to create the LazyList<T> - LINQ to SQL won't load up Products during some wierd type conversion:

public class LazyList<T> : IList<T> {

   
public static IList<T> Create(IQueryable<T> query)
    {
    
    return new LazyList<T>(query);
    }

    // ...

Conclusion? Beware of mismatched types, particularly with IList<T>, and watch out for eager execution with correlated subqueries.

posted by scott with 3 Comments

Visual Designers Don’t Scale

Microsoft has a long history of being visual. They've made quite a bit of money implementing graphical user interfaces everywhere – from operating system products to database servers, and of course, developer products. What would Visual Studio be if it wasn't visual?

And oh how visual it is! Visual Studio includes a potpourri of visualization tools. There are class diagrams, form designers, data designers, server explorers, schema designers, and more. I want to classify all these visual tools into one of two categories. The first category includes all the visual tools that build user interfaces – the WinForms and WebForms designers, for instance. The second category includes everything else.

Visual tools that fall into the first category, the UI builders, are special because they never need to scale. Nobody is building a Windows app for 5,000 x 5,000 pixel screens. Nobody is building web forms with 5,000 textbox controls. At least I hope not. You can get a pretty good sense of when you are going to overwhelm a user just by looking at the designer screen.

Visual tools that fall into the second category have to cover a wide range of scenarios, and they need to scale. I stumbled across an 8-year-old technical report today entitled "Visual Scalability". The report defines visual scalability as the "capability of visualization tools to display large data sets". Although this report has demographics data in mind, you can also think of large data sets as databases with a large number of tables, or libraries with a large number of classes - these are the datasets that Visual Studio works with, and as the datasets grow, the tools fall down.

Here is an excerpt of a screenshot for an Analysis Services project I had to work with recently:

Here is an excerpt of an Entity Data model screenshot I fiddled with for a medical database:

These are just two samples where the visual tools don't scale and inflict pain. They are difficult to navigate, and impossible to search. The layout algorithms don't function well on these large datasets, and number of mouse clicks required to make simple changes is astronomical. The best you can do is jump into the gnarly XML that hides behind the visual representation.

I'm wondering if the future will see a reversal in the number of visual tools trying to enter our development workflow. Perhaps textual representations, like DSLs in IronRuby, will be the trick.

posted by scott with 23 Comments

The Power of Programming With Attributes

Nothing can compare to the Real Power of programming with attributes. Why, just one pair of square brackets and woosh – my object can be serialize to XML. Woosh – my object can persist to a database table. Woosh – there goes my object over the wire in a digitally signed SOAP payload. One day I expect to see a new item template in Visual Studio – the "Add New All Powerful Attributed Class" template: *

[Table]    
[
DataObject]
[
DataContract]    
[
Serializable]
[
TwoKitchenSinks]      
[
CLSCompliant(true)]        
[
DefaultProperty("Name")]
[
DefaultBindingProperty("Name")]
[
DebuggerStepThroughAttribute]
[
GuidAttribute("F0DD2CAA-2132-11DD-AC50-FE9355D89593")]
public class Person
{
    [
Column]        
    [
DataMember]        
    [
XmlAttribute]
    [
Browsable(true)]
    [
ReadOnly(false)]
    [
Category("Advanced")]
    [
Description("The person's name")]        
    
public string Name { get; set; }

    
// TODO: YOUR INSIGNIFIGANT BIZ LOGIC GOES HERE...
}

Which begs the question – could there ever be a way to separate attributes from the class definition?**

* Put down the flamethrower and step away - I'm kidding.

**This part was a serious question.

posted by scott with 14 Comments