Sequence of incorrect length generated by function

Question

Why is the following function returning a sequence of incorrect length when the repl variable is set to false?

open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq) (size : int) (repl : bool) =

    let n = data |> Seq.length

    // without replacement
    let rec generateIndex idx =
        let m = size - Seq.length(idx)
        match m > 0 with
        | true ->
            let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m 
            let idx = (Seq.append idx newIdx) |> Seq.distinct
            generateIndex idx
        | false -> 
            idx

    let sample =
        match repl with
        | true ->
            DiscreteUniform.Samples(0, n-1) 
            |> Seq.take size 
            |> Seq.map (fun index -> Seq.item index data)
        | false ->
            generateIndex (seq []) 
            |> Seq.map (fun index -> Seq.item index data)

    sample

Running the function...

let requested = 1000
let dat = Normal.Samples(0., 1.) |> Seq.take 10000
let resultlen = sample dat requested false |> Seq.length 
printfn "requested -> %A
returned -> %A" requested resultlen

Resulting lengths are wrong.

> 
requested -> 1000
returned -> 998

> 
requested -> 1000
returned -> 1001

> 
requested -> 1000
returned -> 997

Any idea what mistake I'm making?

rmunn · Accepted Answer

First, there's a comment I want to make about coding style. Then I'll get to the explanation of why your sequences are coming back with different lengths.

In the comments, I mentioned replacing match (bool) with true -> ... | false -> ... with a simple if ... then ... else expression, but there's another coding style that you're using that I think could be improved. You wrote:

let sample (various_parameters) =  // This is a function
    // Other code ...
    let sample = some_calculation  // This is a variable
    sample  // Return the variable

While F# allows you to reuse names like that, and the name inside the function will "shadow" the name outside the function, it's generally a bad idea for the reused name to have a totally different type than the original name. In other words, this can be a good idea:

let f (a : float option) =
    let a = match a with
            | None -> 0.0
            | Some value -> value
    // Now proceed, knowing that `a` has a real value even if had been None before

Or, because the above is exactly what F# gives you defaultArg for:

let f (a : float option) =
    let a = defaultArg a 0.0
    // This does exactly the same thing as the previous snippet

Here, we are making the name a inside our function refer to a different type than the parameter named a: the parameter was a float option, and the a inside our function is a float. But they're sort of the "same" type -- that is, there's very little mental difference between "The caller may have specified a floating-point value or they may not" and "Now I definitely have a floating-point value". But there's a very large mental gap between "The name sample is a function that takes three parameters" and "The name sample is a sequence of floats". I strongly recommend using a name like result for the value you're going to return from your function, rather than re-using the function name.

Also, this seems unnecessarily verbose:

let result =
    match repl with
    | true ->
        DiscreteUniform.Samples(0, n-1) 
        |> Seq.take size 
        |> Seq.map (fun index -> Seq.item index data)
    | false ->
        generateIndex (seq []) 
        |> Seq.map (fun index -> Seq.item index data)

result

Anytime I find myself writing "let result = (something) ; result" at the end of my function, I usually just want to replace that whole code block with just the (something). I.e., the above snippet could just become:

match repl with
| true ->
    DiscreteUniform.Samples(0, n-1) 
    |> Seq.take size 
    |> Seq.map (fun index -> Seq.item index data)
| false ->
    generateIndex (seq []) 
    |> Seq.map (fun index -> Seq.item index data)

Which in turn can be replaced with an if...then...else expression:

if repl then
    DiscreteUniform.Samples(0, n-1) 
    |> Seq.take size 
    |> Seq.map (fun index -> Seq.item index data)
else
    generateIndex (seq []) 
    |> Seq.map (fun index -> Seq.item index data)

And that's the last expression in your code. In other words, I would probably rewrite your function as follows (changing ONLY the style, and making no changes to the logic):

open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq) (size : int) (repl : bool) =

    let n = data |> Seq.length

    // without replacement
    let rec generateIndex idx =
        let m = size - Seq.length(idx)
        if m > 0 then
            let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m 
            let idx = (Seq.append idx newIdx) |> Seq.distinct
            generateIndex idx
        else
            idx

    if repl then
        DiscreteUniform.Samples(0, n-1) 
        |> Seq.take size 
        |> Seq.map (fun index -> Seq.item index data)
    else
        generateIndex (seq []) 
        |> Seq.map (fun index -> Seq.item index data)

If I can figure out why your sequences have the wrong length, I'll update this answer with that information as well.

UPDATE: Okay, I think I see what's happening in your generateIndex function that's giving you unexpected results. There are two things tripping you up: one is sequence laziness, and the other is randomness.

I copied your generateIndex function into VS Code and added some printfn statements to look at what's going on. First, the code I ran, and then the results:

let rec generateIndex n size idx =
    let m = size - Seq.length(idx)
    printfn "m = %d" m
    match m > 0 with
    | true ->
        let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
        printfn "Generating newIdx as %A" (List.ofSeq newIdx)
        let idx = (Seq.append idx newIdx) |> Seq.distinct
        printfn "Now idx is %A" (List.ofSeq idx)
        generateIndex n size idx
    | false -> 
        printfn "Done, returning %A" (List.ofSeq idx)
        idx

All those List.ofSeq idx calls are so that F# Interactive would print more than four items of the seq when I print it out (by default, if you try to print a seq with %A, it will only print out four values and then print an ellipsis if there are more values available in the seq). Also, I turned n and size into parameters (that I don't change between calls) so that I could test it easily. I then called it as generateIndex 100 5 (seq []) and got the following result:

m = 5
Generating newIdx as [74; 76; 97; 78; 31]
Now idx is [68; 28; 65; 58; 82]
m = 0
Done, returning [37; 58; 24; 48; 49]
val it : seq = seq [12; 69; 97; 38; ...]

See how the numbers keep changing? That was my first clue that something was up. See, seqs are lazy. They don't evaluate their contents until they have to. You shouldn't think of a seq as a list of numbers. Instead, think of it as a generator that will, when asked for numbers, produce them according to some rule. In your case, the rule is "Choose random integers between 0 and n-1, then take m of those numbers". And the other thing about seqs is that they do not cache their contents (although there's a Seq.cache function available that will cache their contents). Therefore, if you have a seq based on a random number generator, its results will be different each time, as you can see in my output. When I printed out newIdx, it printed out as [74; 76; 97; 78; 31], but when I appended it to an empty seq, the result printed out as [68; 28; 65; 58; 82].

Why this difference? Because Seq.append does not force evaluation. It simply creates a new seq whose rule is "take all items from the first seq, then when that one exhausts, take all items from the second seq. And when that one exhausts, end." And Seq.distinct does not force evaluation either; it simply creates a new seq whose rule is "take the items from the seq handed to you, and start handing them out when asked. But memorize them as you go, and if you've handed one of them out before, don't hand it out again." So what you are passing around between your calls to generateIdx is an object that, when evaluated, will pick a set of random numbers between 0 and n-1 (in my simple case, between 0 and 100) and then reduce that set down to a distinct set of numbers.

Now, here's the thing. Every time you evaluate that seq, it will start from the beginning: first calling DiscreteUniform.Samples(0, n-1) to generate an infinite stream of random numbers, then selecting m numbers from that stream, then throwing out any duplicates. (I'm ignoring the Seq.append for now, because it would create unnecessary mental complexity and it isn't really part of the bug anyway). Now, at the start of each go-round of your function, you check the length of the sequence, which does cause it to be evaluated. That means that it selects (in the case of my sample code) 5 random numbers between 0 and 99, then makes sure that they're all distinct. If they are all distinct, then m = 0 and the function will exit, returning... not the list of numbers, but the seq object. And when that seq object is evaluated, it will start over from the beginning, choosing a different set of 5 random numbers and then throwing out any duplicates. Therefore, there's still a chance that at least one of that set of 5 numbers will end up being a duplicate, because the sequence whose length was tested (which we know contained no duplicates, otherwise m would have been greater than 0) was not the sequence that was returned. The sequence that was returned has a 1.0 * 0.99 * 0.98 * 0.97 * 0.96 chance of not containing any duplicates, which comes to about 0.9035. So there's a just-under-10% chance that even though you checked Seq.length and it was 5, the length of the returned seq ends up being 4 after all -- because it was choosing a different set of random numbers than the one you checked.

To prove this, I ran the function again, this time only picking 4 numbers so that the result would be completely shown at the F# Interactive prompt. And my run of generateIndex 100 4 (seq []) produced the following output:

m = 4
Generating newIdx as [36; 63; 97; 31]
Now idx is [39; 93; 53; 94]
m = 0
Done, returning [47; 94; 34]
val it : seq = seq [48; 24; 14; 68]

Notice how when I printed "Done, returning (value of idx)", it had only 3 values? Even though it eventually returned 4 values (because it picked a different selection of random numbers for the actual result, and that selection had no duplicates), that demonstrated the problem.

By the way, there's one other problem with your function, which is that it's far slower than it needs to be. The function Seq.item, in some circumstances, has to run through the sequence from the beginning in order to pick the nth item of the sequence. It would be far better to store your data in an array at the start of your function (let arrData = data |> Array.ofSeq), then replace

        |> Seq.map (fun index -> Seq.item index data)

with

        |> Seq.map (fun index -> arrData.[index])

Array lookups are done in constant time, so that takes your sample function down from O(N^2) to O(N).

TL;DR: Use Seq.distinct before you take m values from it and the bug will go away. You can just replace your entire generateIdx function with a simple DiscreteUniform.Samples(0, n-1) |> Seq.distinct |> Seq.take size. (And use an array for your data lookups so that your function will run faster). In other words, here's the ~~final~~ almost-final version of how I would rewrite your code:

let sample (data : seq) (size : int) (repl : bool) =
    let arrData = data |> Array.ofSeq
    let n = arrData |> Array.length

    if repl then
        DiscreteUniform.Samples(0, n-1) 
        |> Seq.take size 
        |> Seq.map (fun index -> arrData.[index])
    else
        DiscreteUniform.Samples(0, n-1) 
        |> Seq.distinct
        |> Seq.take size 
        |> Seq.map (fun index -> arrData.[index])

That's it! Simple, easy to understand, and (as far as I can tell) bug-free.

Edit: ... but not completely DRY, because there's still a bit of repeated code in that "final" version. (Credit to CaringDev for pointing it out in the comments below). The Seq.take size |> Seq.map is repeated in both branches of the if expression, so there's a way to simplify that expression. We could do this:

let randomIndices =
    if repl then
        DiscreteUniform.Samples(0, n-1) 
    else
        DiscreteUniform.Samples(0, n-1) |> Seq.distinct

randomIndices
|> Seq.take size 
|> Seq.map (fun index -> arrData.[index])

So here's a truly-final version of my suggestion:

let sample (data : seq) (size : int) (repl : bool) =
    let arrData = data |> Array.ofSeq
    let n = arrData |> Array.length
    let randomIndices =
        if repl then
            DiscreteUniform.Samples(0, n-1) 
        else
            DiscreteUniform.Samples(0, n-1) |> Seq.distinct
    randomIndices
    |> Seq.take size 
    |> Seq.map (fun index -> arrData.[index])

Sequence of incorrect length generated by function

Answers (1)

Related Questions