How can I repeatedly read in shuffled lines of a large data file in Haskell?

Question

I have a data file of 60k lines, where each line has ~1k comma separated Ints (that I want to immediately turn into Doubles).

I want to iterate over a sequence of random "batches" of 32 lines, where a batch is a random subset of all of the lines, and none of the batches share lines in common. Since there are 60k lines and 32 lines per batch, there should be 1875 batches.

I'm open to changing things if necessary, but I'd like them to be in the form of a list (of batches) that's lazily evaluated. The code that needs this is a foldM, where I'm using it like:

resulting_struct <- foldM fold_fn my_struct batch_list

so that it repeatedly calls fold_fn on the result of the current accumulator my_struct and the next element of batch_list.

I'm very confused. It was easy when I didn't need to shuffle them; I simply read them in and chunked them, and they were evaluated lazily, so I had no problems. Now I'm completely stuck and feel like I must be missing something simple.

I've tried the following:

Reading the file into a list of lines and naively shuffling the input. This doesn't work, as readFile is lazily evaluated, but it needs to read the whole file into memory to shuffle it randomly, and it quickly eats up all my ~8 GB RAM.
Getting the length of the file, and then creating a list of batches of shuffled indices from 0 to 60k that correspond to the line numbers that will be selected to form the batches. Then, when I want to actually get the data batches, I do:

ind_batches <- get_shuffled_ind_batches_from_file fname
batch_list <- mapM (get_data_batch_from_ind_batch fname) ind_batches

where:

get_shuffled_ind_batches_from_file :: String -> IO [[Int]]
get_shuffled_ind_batches_from_file fname = do
  contents <- get_contents_from_file fname -- uses readFile, returns [[Double]]
  let n_samps = length contents
      ind = [0..(n_samps-1)]
  shuffled_indices <- shuffle_list ind
  let shuffled_ind_chunks = take 1800 $ chunksOf 32 shuffled_indices
  return shuffled_ind_chunks

get_data_batch_from_ind_batch :: String -> [Int] -> IO [[Double]]
get_data_batch_from_ind_batch fname ind_chunk = do
  contents <- get_contents_from_file fname
  let data_batch = get_elems_at_indices contents ind_chunk
  return data_batch

shuffle_list :: [a] -> IO [a]
shuffle_list xs = do
        ar <- newArray n xs
        forM [1..n] $ \i -> do
            j <- randomRIO (i,n)
            vi <- readArray ar i
            vj <- readArray ar j
            writeArray ar j vi
            return vj
  where
    n = length xs
    newArray :: Int -> [a] -> IO (IOArray Int a)
    newArray n xs =  newListArray (1,n) xs

get_elems_at_indices :: [a] -> [Int] -> [a]
get_elems_at_indices my_list ind_list = (map . (!!)) my_list ind_list

however, it seems like mapM evaluates immediately, which then tries to read in the file contents repeatedly (I think, the RAM blows up anyway).

A bit more searching told me that I could try using unsafeInterleaveIO to make it so it lazily evaluates an action, so I tried sticking it in like so:

get_data_batch_from_ind_batch :: String -> [Int] -> IO [[Double]]
get_data_batch_from_ind_batch fname ind_chunk = unsafeInterleaveIO $ do
  contents <- get_contents_from_file fname
  let data_batch = get_elems_at_indices contents ind_chunk
  return data_batch

but no luck, same problem as above.

I feel like I've been banging my head against the wall here and must be missing something very simple. Someone suggested using streams or conduits instead, but when I looked at the documentation for them, it wasn't really clear to me how I could use them to solve this problem.

How can I read in a large data file and also shuffle it, without using up all my memory?

amalloy · Accepted Answer

hGetContents will return the contents of the file lazily, but if you do much of anything with the result you will realize the whole file at once. I suggest reading the file once, and scanning over it for newlines, so that you can build an index of which chunk starts at which byte offset. That index will be quite small, so you can shuffle it easily. Then you can iterate through the index, each time opening the file and reading only a defined sub-range of it, and parsing only that one chunk.

How can I repeatedly read in shuffled lines of a large data file in Haskell?

Answers (1)

Related Questions