What’s Wrong With This Code (#23)

Sunday, October 18, 2009

Keith Dahlby started a discussion with his post “Is Functional Abstraction Too Clever?” Read Keith’s post for the background, but I would agree with all the comments I’ve read so far and I say the functional approach is a good solution to the problem.

Keith’s post did get me thinking of things that could go wrong in the hands of a developer who doesn’t understand how in-memory LINQ is working.

As an example, let’s use Keith’s Values extension method …

public static IEnumerable<int> Values(
    this Random random, 
    int minValue, int maxValue)
{
    while (true)
        yield return random.Next(minValue, maxValue);
}

… and let’s say a developer is using the above to satisfy the following requirements.

  1. Give me an array of 10 random numbers between 1 and 100
  2. The 10 numbers should be unique
  3. The 10 numbers should be appear in order from lowest value to highest value

What’s wrong with the following code?

var values =
    new Random().Values(1, 100)
                .Distinct()
                .OrderBy(value => value)
                .Take(10)
                .ToArray();

Do you think it’s a subtle problem?

What’s an easy fix?


Comments
gravatar Hirvox Sunday, October 18, 2009
Sorting an infinite enumeration is literally going to take forever. If Distinct internally uses a helper list to temporarily store previously seen values, switching the places of the Take and OrderBy clauses would work.
gravatar Jonathan Allen Sunday, October 18, 2009
Seems to me that infinite enumerations, while interesting, should never be allowed in production code. Just because you can do it doesn't mean you should.
Jeremy Gray Monday, October 19, 2009
As Jonathan said, I suspect the simplest solution is to perform the Take first. I do agree that the OrderBy behavior presents a subtle problem in this scenario, a subtle problem that some might miss while they get used to such a style of programming. That said, as with the discussions on the other thread, at least with code this brief and clear it becomes easy to spot a problem like the OrderBy call in your example. Such an issue might well still exist but be far more easily hidden in a more imperative solution. Sometimes, migrating small pieces of imperative code to this declarative style doesn't produce a large enough win, but you would not believe the number of bugs I have found in other people's code by doing just such a migration. After such behavior-preserving transformations, bugs literally jump out at you.
gravatar Hirvox Monday, October 19, 2009
Now that I've had my morning caffeine dose, it's time to elaborate:

This is a halting problem.

At some point, the internal list Distinct uses is going to contain every number between 1 and 100, and all of those were relayed to OrderBy. But then OrderBy requests the next integer in the enumeration. Try as it might, the Random routine is not going to generate an integer that satisfies the Distinct clause, causing an infinite loop.

And even my solution isn't flawless, because it takes a nondeterministic amount of time to evaluate. The random number generation enumeration could keep returning the same nondistinct numbers for an indefinite amount of time. It would probably even pass simple unit tests and cause random and hard to diagnose slowdowns in production. And if the piece of code is ever parametrized and the Take value is larger than the integer range, we would be right back where we started.

For small random integer ranges, it would be better to generate a list that contains all values in the range. Then you pick a number at a random spot in the list, remove it from the list and return it. At no point would more than ten random numbers be generated.
gravatar Rik Hemsley Monday, October 19, 2009
I'm guessing Distinct() never returns, because it doesn't know when to stop.

I don't think there's a simple fix, because no matter how many numbers you take, there's always the possibility that you don't have ten distinct.

Practically, you could keep taking numbers until you have ten distinct and this should complete within milliseconds, but theoretically it may never complete.
gravatar Rik Hemsley Monday, October 19, 2009
HIrvox's solution sounds good to me. Make a collection of possible numbers, take one at a random position until you have enough. You'd have to count down the range from which you pluck to match the decreasing size of the collection with each iteration.
gravatar dcarapic Monday, October 19, 2009
To be honest I do not see how you can make this work with pure Linq, there is no guarantee on when Random will return 10 unique numbers, could be first 10 are unique or you could get 10 uniques after taking 20 values from Random.
gravatar mat roberts Monday, October 19, 2009
Isn't there a subtle problem with requirements 1 and 2 being contradictory. You can't have 10 random numbers and insist that they are unique.

I guess you could ask for the first 10 numbers, from a random shuffle of numbers 1..100.
gravatar Jason Meckley Monday, October 19, 2009
The problem is ordering before taking. if you order first you will always get the lowest numbers. what you want is the random numbers ordered.

var values = new Random()
.Values(1, 100)
.Distinct()
.Take(10)
.OrderBy(value => value)
.ToArray();

should do the trick.
tobi Monday, October 19, 2009
there is no easy but correct fix. i would just do:

var values = new Random()
.Values(1, 100)
.Take(20) //take enough to make 10 distict numbers
.Distinct()
.Take(10)
.OrderBy(value => value)
.ToArray();

this is of course not entirely correct as few than 10 numbers could result. a correct solution would be to implement a new extension TakeDistinct that takes until N distinct numbers have arrived.
gravatar Jonathan Pryor Monday, October 19, 2009
LINQ expects, effectively, "pure" functions -- functions without side effects. The .Values() extension method is not pure. That's the problem.

The solution is to make things pure, possibly by using IEnumerable.Let():

community.bartdesmet.net/...

(Yes, that should make your brain hurt.)

The result:

var values = new Random()
.Values(1, 100)
.Let(src => src.Take(20)
.Distinct()
.OrderBy(value => value))
.Take(10)
.ToArray();
gravatar Scott Allen Monday, October 19, 2009
I was thinking the biggest problem was using OrderBy against an infinite sequence (thus the query runs forever), and switching the Take and OrderBy calls around would fix everything and make the world a safer place to live.

You guys have raised some other interesting points, though, and yes @Jonathon - anything pointing to Bart's blog makes my head hurt. :)
Will Smith Monday, October 19, 2009
Although Jason is right and moving the Take(10) ahead of the SortBy() "fixes" the infinite loop, I agree with Hirvox. Modify the Values extension method to always return 1 and you will see the Distinct().Take(10) never returns. You have no control over performance here.

One implementation of this "shuffling" problem is:

var randGen = new Random();
var source = Enumerable.Range( 1, 100 ).ToList();
var randList = Enumerable.Repeat( 1, 10 )
.Select( i =>
{
var index = randGen.Next( 101 - i );
var value = source[index];
source.RemoveAt( index );
return value;
} );

I'm trying to wrap my head around a "pure" linq solution, but I'm just not grasping it.

@tobi, There is no guarantee that Take(20), that is, 20 random numbers would yield at least 10 distinct values. It's possible (yet improbable) that all twenty values are in fact the same.
Will Smith Monday, October 19, 2009
or... how about this:

var values =
new Random().Shuffle( 1, 100 )
.Take( 10 )
.OrderBy( value => value )
.ToArray();

Where,

public static IEnumerable<int> Shuffle(
this Random random,
int minValue, int maxValue )
{
var source = Enumerable.Range( minValue, maxValue ).ToList();
while(source.Count > 1)
{
var index = random.Next( source.Count );
var value = source[index];
source.RemoveAt( index );
yield return value;
}

yield return source[0];
}

In fact:

var values =
new Random().Shuffle( 1, 100 )
.OrderBy( value => value )
.Take( 10 )
.ToArray();

would still be okay, because the Shuffle extension would never iterate more than the maxValue (100).
gravatar Mark Brackett Monday, October 19, 2009
Yeah, you can't really Distinct an infinite list. Nor can you Order it. My first thought was just calling .Take first, but that'd violate the Distinct requirement. So, you'd have to use TakeWhile with a helper list instead:

List<int> seenValues = new List<int>();
var values =
new Random().Values(1, 100)
.TakeWhile(i => {
if (!seenValues.Contains(i)) {
seenValues.Add(i);
}
return seenValues.Count < 10;
})
.Distinct()
.OrderBy(value => value)
.ToArray();

I'd probably combine the TakeWhile and Distinct into a TakeDistinct(int count) extension method just for cleanness, but that gives the idea.

Which, yeah - is subtle. But, debugging the hang would spot it immediately - and it'd fail unit testing immediately (well, at least it'd never return).
gravatar Alexander Abramov Friday, October 30, 2009
@Mark "Yeah, you can't really Distinct an infinite list."

Actually, you can. Distinct just filters list using knowledge about previously encountered elements. It does not try to enumerate infinite sequence itself. OrderBy is of course different story.

@Jonathan
Thanks for pointer, actually examples are very approachable.

@All
I tried to chew a little more on the topic:
blog.casualdev.net/...
Comments are now closed.
by K. Scott Allen K.Scott Allen
My Pluralsight Courses
The Podcast!