Reputation: 1784
I need to import a large text file (55MB) (525000 * 25) and manipulate the data and produce some output. As usual I started exploring with f# interactive, and I get some really strange behaviours.
Is this file too large or my code wrong?
First test was to import and simply comute the sum over one column (not the end goal but first test):
let calctest =
let reader = new StreamReader(path)
let csv = reader.ReadToEnd()
csv.Split([|'\n'|])
|> Seq.skip 1
|> Seq.map (fun line -> line.Split([|','|]))
|> Seq.filter (fun a -> a.[11] = "M")
|> Seq.map (fun values -> float(values.[14]))
As expected this produces a seq of float both in typecheck and in interactive. If I know add:
|> Seq.sum
Type check works and says this function should return a float but if I run it in interactive I get this error:
System.IndexOutOfRangeException: Index was outside the bounds of the array
Then I removed the last line again and thought I look at the seq of float in a text file:
let writetest =
let str = calctest |> Seq.map (fun i -> i.ToString())
System.IO.File.WriteAllLines("test.txt", str )
Again, this passes the type check but throws errors in interactive.
Can the standard StreamReader not handle that amount of data? or am I going wrong somewhere? Should I use a different function then Streamreader? Thanks.
Upvotes: 2
Views: 719
Reputation: 5359
Seq
is lazy, which means that only when you add the Seq.sum
is all the mapping and filtering actually being done, that's why you don't see the error before adding that line. Are you sure you have 15 columns on all rows? That's probably the problem
I would advise you to use the CSV Type Provider instead of just doing a string.Split
, that way you'll be sure to not have an accidental IndexOutOfRangeException, and you'll handle ,
escaping correctly.
Additionaly, you're reading the whole csv file into memory by calling reader.ReadToEnd()
, the CsvProvider supports streaming if you set the Cache
parameter to false. It's not a problem with a 55MB file, but if you have something much larger it might be
Upvotes: 4