RobVious
RobVious

Reputation: 12915

Given a list of length n select k random elements using C#

I found this post:

Efficiently selecting a set of random elements from a linked list

But this means that in order to approach true randomness in the sample I have to iterate over all elements, throw them in memory with a random number, and then sort. I have a very large set of items here (millions) - is there a more efficient approach to this problem?

Upvotes: 3

Views: 2595

Answers (3)

Jon Skeet
Jon Skeet

Reputation: 1500903

I would suggest simply shuffling elements as if you were writing a modified Fisher-Yates shuffle, but only bother shuffling the first k elements. For example:

public static void PartialShuffle<T>(IList<T> source, int count, Random random)
{
    for (int i = 0; i < count; i++)
    {
        // Pick a random element out of the remaining elements,
        // and swap it into place.
        int index = i + random.Next(source.Count - i);
        T tmp = source[index];
        source[index] = source[i];
        source[i] = tmp;
    }
}

After calling this method, the first count elements will be randomly picked elements from the original list.

Note that I've specified the Random as a parameter, so that you can use the same one repeatedly. Be careful about threading though - see my article on randomness for more information.

Upvotes: 11

Olivier Jacot-Descombes
Olivier Jacot-Descombes

Reputation: 112402

If the elements can be in memory, put them in memory first

List<Element> elements = dbContext.Select<Element>();

Now you know the number of elements. Create a set of unique indexes.

var random = new Random();
var indexes = new HashSet<int>();
while (indexes.Count < k) {
    indexes.Add(random.Next(elements.Count));
}

Now you can read the elements from the list

var randomElements = indexes.Select(i => elements[i]);

I assume that the DB contains unique elements. If this is not the case, you will have to create a HashSet<Elements> instead or to append .Distinct() when querying from the DB.


UPDATE

As Patricia Shanahan says, this method will work well if k is small compared to n. If it is not the case, I suggest selecting a set n - k indexes to be excluded

var random = new Random();
var indexes = new HashSet<int>();
IEnumerable<Element> randomElements;

if (k <= elements.Count / 2) {
    while (indexes.Count < k) {
        indexes.Add(random.Next(elements.Count));
    }
    randomElements = indexes.Select(i => elements[i]);
} else {
    while (indexes.Count < elements.Count - k) {
        indexes.Add(random.Next(elements.Count));
    }
    randomElements = elements
        .Select((e,i) => indexes.Contains(i) ? null : elements[i])
        .Where(e => e != null);
}

Upvotes: 3

Code Uniquely
Code Uniquely

Reputation: 6373

Take a look at this extension method http://extensionmethod.net/csharp/ienumerable-t/shuffle. You could add Skip() Take() type to page the values out the final list.

Upvotes: 3

Related Questions