Load and random shuffle 8 gigabytes of csv data in Python

Question

Essentially, I've got 8 gigabytes of CSV data and I want to shuffle it randomly so that I can do mini batches in my ML model. However, if I were to load 8gb of data straight into Python and shuffle it, there seems to have a memory problem.

But, if I load data chunk by chunk then shuffle it, then the data is still in the same pattern since it is sorted originally. This is what I've done so far.

import pandas as pd
import numpy as np

// get data with size equal to CHUNK_SIZE
reader = pd.read_csv(path , header=0, iterator=True)
data = reader.get_chunk(CHUNK_SIZE)

// randomly shuffle
data = np.random.shuffle(data)

Are there any ways that I can do it fast and efficiently? Thank you.

UPDATE: I have approximately 30,000,000 rows and it has been sorted by time.

Mark Setchell · Accepted Answer

Here's a concept...

Generate 30,000,000 line CSV with Perl - takes 11 seconds on my Mac:

perl -E 'for($i=0;$i<30000000;$i++){say "Line $i,field2,field3,",int rand 100}' > BigBoy.csv

Sample Output

Line 0,field2,field3,49
Line 1,field2,field3,6
Line 2,field2,field3,15
...
Line 29999998,field2,field3,79
Line 29999999,field2,field3,19

Take 1% of the lines and shuffle them - takes 3 seconds and 15MB of RAM:

awk 'rand()>0.99' BigBoy.csv | gshuf > RandomSet.csv

RandomSet.csv contains 299,748 lines:

Sample Output

Line 15348259,field2,field3,95
Line 1642442,field2,field3,93
Line 29199452,field2,field3,52

gshuf installed on Mac using homebrew:

brew install coreutils

Load and random shuffle 8 gigabytes of csv data in Python

Answers (1)

Related Questions