Sibi
Sibi

Reputation: 48644

Dealing with huge data

Let's assume that I have a big file (500GB+) and I have a data record declaration Sample which indicates a row in that file:

data Sample = Sample {
field1 :: Int,
field2 :: Int
}

Now what is the data structure suitable for processing (filter/map/fold) on the collection of these Sample datas ? Don Stewart has answered here that the Sample type should not be treated as a list [Sample] type but as a Vector type. My question is how does representing it as Vector type solve the problem ? Doesn't representing the file contents as a vector of Sample type will also occupy around 500Gb ?

What is the recommended method for solving these types of problem ?

Upvotes: 2

Views: 151

Answers (1)

Zeta
Zeta

Reputation: 105876

As far as I can see, the operations you want to use (filter, map and fold) can be done via both conduit (see Data.Conduit.List) and pipes (see Pipes.Prelude).

Both libraries are perfectly capable of manipulating/folding and filtering streaming data. Depending on your scenario they might solve your actual problem.

If you, however, need to investigate values several times, you're better of by loading chunks into a vector, as @Don said.

Upvotes: 3

Related Questions