Reputation: 105
What is the best way to read a huge file (around 1 TB) in haskell. Basically the file contains a matrix of integer data. and I may need to (efficiently ) calculate the correlation between the different rows or between columns.
I have previously used pytables for this but was thinking of trying the same in haskell. I know haskell has some hdf5 bindings but is there any other options which I am not aware of ?
Upvotes: 8
Views: 1354
Reputation: 63359
You could also give mmap a try. For example, you can map a whole file as a ByteString
:
import Data.ByteString as B
import System.IO.MMap
main = do
bs <- mmapFileByteString "myLargeFile" Nothing
let l = B.length bs
print l
-- print last 1024 bytes:
let bs2 = B.drop (l - 1024) bs
print (B.unpack bs2)
Cutting a piece out of it is fast - no data is copied. Then you can use whatever tool to parse ByteString
s.
Upvotes: 11
Reputation: 13876
Consider iteratee package. It supports seek, and attoparsec-iteratee package provides integration with attoparsec.
The hSeek
+ hGet
approach Roman suggested is low level. iteratee
is higher level approach, but may be harder for beginners.
Upvotes: 4
Reputation: 38708
As in any other language: you seek (using System.IO.hSeek
), and then use binary IO (Data.ByteString.hGet
). Then you parse the result (e.g. using attoparsec) and process as needed.
Upvotes: 13