nik
nik

Reputation: 1784

Working with large text files?

I need to import a large text file (55MB) (525000 * 25) and manipulate the data and produce some output. As usual I started exploring with f# interactive, and I get some really strange behaviours.

Is this file too large or my code wrong?

First test was to import and simply comute the sum over one column (not the end goal but first test):

let calctest =
let reader = new StreamReader(path)
let csv = reader.ReadToEnd()    
csv.Split([|'\n'|])
|> Seq.skip 1
|> Seq.map (fun line -> line.Split([|','|]))
|> Seq.filter (fun a -> a.[11] = "M")
|> Seq.map (fun values -> float(values.[14]))

As expected this produces a seq of float both in typecheck and in interactive. If I know add:

|> Seq.sum

Type check works and says this function should return a float but if I run it in interactive I get this error:

System.IndexOutOfRangeException: Index was outside the bounds of the array

Then I removed the last line again and thought I look at the seq of float in a text file:

let writetest = 
let str = calctest |> Seq.map (fun i -> i.ToString())
System.IO.File.WriteAllLines("test.txt", str )

Again, this passes the type check but throws errors in interactive.

Can the standard StreamReader not handle that amount of data? or am I going wrong somewhere? Should I use a different function then Streamreader? Thanks.

Upvotes: 2

Views: 719

Answers (1)

Gustavo Guerra
Gustavo Guerra

Reputation: 5359

Seq is lazy, which means that only when you add the Seq.sum is all the mapping and filtering actually being done, that's why you don't see the error before adding that line. Are you sure you have 15 columns on all rows? That's probably the problem

I would advise you to use the CSV Type Provider instead of just doing a string.Split, that way you'll be sure to not have an accidental IndexOutOfRangeException, and you'll handle , escaping correctly.

Additionaly, you're reading the whole csv file into memory by calling reader.ReadToEnd(), the CsvProvider supports streaming if you set the Cache parameter to false. It's not a problem with a 55MB file, but if you have something much larger it might be

Upvotes: 4

Related Questions