Abhijit Ray
Abhijit Ray

Reputation: 105

Random access on a huge file in haskell

What is the best way to read a huge file (around 1 TB) in haskell. Basically the file contains a matrix of integer data. and I may need to (efficiently ) calculate the correlation between the different rows or between columns.

I have previously used pytables for this but was thinking of trying the same in haskell. I know haskell has some hdf5 bindings but is there any other options which I am not aware of ?

Upvotes: 8

Views: 1354

Answers (3)

Petr
Petr

Reputation: 63359

You could also give mmap a try. For example, you can map a whole file as a ByteString:

import Data.ByteString as B
import System.IO.MMap

main = do
    bs <- mmapFileByteString "myLargeFile" Nothing
    let l = B.length bs
    print l
    -- print last 1024 bytes:
    let bs2 = B.drop (l - 1024) bs
    print (B.unpack bs2)

Cutting a piece out of it is fast - no data is copied. Then you can use whatever tool to parse ByteStrings.

Upvotes: 11

Yuras
Yuras

Reputation: 13876

Consider iteratee package. It supports seek, and attoparsec-iteratee package provides integration with attoparsec.

The hSeek + hGet approach Roman suggested is low level. iteratee is higher level approach, but may be harder for beginners.

Upvotes: 4

Roman Cheplyaka
Roman Cheplyaka

Reputation: 38708

As in any other language: you seek (using System.IO.hSeek), and then use binary IO (Data.ByteString.hGet). Then you parse the result (e.g. using attoparsec) and process as needed.

Upvotes: 13

Related Questions