Reputation: 23
I am a F# newbie here, the following code is to get all the lines from a csv file which rating > 9.0 then output to a new file.
But I have a hard time figuring out how many for loop does it takes to get the job done, is it all the steps are done in one for loop or 5 for loops as I highlight below?
If it need 5 for loops to get the job done, it should have load the whole file into memory after completing the 2nd loop, but the process only consumes 14M of memory the whole time, less than the csv file.
I know StreamReader will not load the whole file into memory at once, but how the following code execute?
thx in advance...
let ratings = @"D:\Download\IMDB\csv\title.ratings.csv"
let rating9 = @"D:\Download\IMDB\csv\rating9.csv"
let readCsv reader =
Seq.unfold (fun (r:StreamReader) -> // 1st for loop
match r.EndOfStream with
| true -> None
| false -> Some (r.ReadLine(), r)) reader
let toTuple = fun (s:string) ->
let ary = s.Split(',')
(string ary.[0], float ary.[1], int ary.[2])
using (new StreamReader(ratings)) (fun sr ->
use sw = new StreamWriter(rating9)
readCsv sr
|> Seq.map toTuple // 2nd for loop
|> Seq.filter (fun (_, r, _) -> r > 9.0) // 3rd for loop
|> Seq.sortBy (fun (_, r, _) -> r) // 4th for loop
|> Seq.iter (fun (t, r, s) -> // 5th for loop
sw.WriteLine(sprintf "%s,%.1f,%i" t r s)))
Upvotes: 2
Views: 442
Reputation: 36688
The missing piece in your understanding is that F#'s Seq
is lazy. It will not do any more work than needed, and in particular, it will not consume the sequence until it's absolutely necessary to do so. In particular, Seq.map
and Seq.filter
do not act like for loops; instead, they act like a transformation pipeline that stacks a new transformation on top of the existing ones. The first piece of your code that will actually run through the entire look is Seq.sortBy
(because sorting a sequence requires knowing what all its values are, so Seq.sortBy
must consume the entire sequence to do its job). And by that point, the Seq.filter
step has already happened, so many lines of the CSV file have been thrown out, and that's why the program consumes less memory than the total size of the original file.
Here's a practical demonstration of Seq
being lazy, typed into the F# Interactive prompt. Watch this:
> let s = seq {1..20} ;;
val s : seq<int>
> let t = s |> Seq.map (fun i -> printfn "Starting with %d" i; i) ;;
val t : seq<int>
> let u = t |> Seq.map (fun i -> i*2) ;;
val u : seq<int>
> let v = u |> Seq.map (fun i -> i - 1) ;;
val v : seq<int>
> let w = v |> Seq.filter (fun i -> i > 10) ;;
val w : seq<int>
> let x = w |> Seq.sortBy id ;;
val x : seq<int>
> let y = x |> Seq.iter (fun i -> printfn "Result: %d" i) ;;
Starting with 1
Starting with 2
Starting with 3
Starting with 4
Starting with 5
Starting with 6
Starting with 7
Starting with 8
Starting with 9
Starting with 10
Starting with 11
Starting with 12
Starting with 13
Starting with 14
Starting with 15
Starting with 16
Starting with 17
Starting with 18
Starting with 19
Starting with 20
Result: 11
Result: 13
Result: 15
Result: 17
Result: 19
Result: 21
Result: 23
Result: 25
Result: 27
Result: 29
Result: 31
Result: 33
Result: 35
Result: 37
Result: 39
val y : unit = ()
> let z = w |> Seq.iter (fun i -> printfn "Result: %d" i) ;;
Starting with 1
Starting with 2
Starting with 3
Starting with 4
Starting with 5
Starting with 6
Result: 11
Starting with 7
Result: 13
Starting with 8
Result: 15
Starting with 9
Result: 17
Starting with 10
Result: 19
Starting with 11
Result: 21
Starting with 12
Result: 23
Starting with 13
Result: 25
Starting with 14
Result: 27
Starting with 15
Result: 29
Starting with 16
Result: 31
Starting with 17
Result: 33
Starting with 18
Result: 35
Starting with 19
Result: 37
Starting with 20
Result: 39
val z : unit = ()
Notice how even though Seq.sortBy
needs to consume the whole list in order to do its jobs, since no part of the Seq
was being requested when I created sequence x
, it didn't actually start running through the values. Only sequences y
and z
, which used Seq.iter
, actually triggered running through all the values. (But you can see how, with y
the sortBy
step had to run in its entirety before the iter
step could run, but with z
where there was no sortBy
step, each value passed all the way through the transformation pipeline one at a time, and only once each value was completely handled did the next value start getting processed).
Upvotes: 6