Reputation: 695
Essentially, I've got 8 gigabytes of CSV data and I want to shuffle it randomly so that I can do mini batches in my ML model. However, if I were to load 8gb of data straight into Python and shuffle it, there seems to have a memory problem.
But, if I load data chunk by chunk then shuffle it, then the data is still in the same pattern since it is sorted originally. This is what I've done so far.
import pandas as pd
import numpy as np
// get data with size equal to CHUNK_SIZE
reader = pd.read_csv(path , header=0, iterator=True)
data = reader.get_chunk(CHUNK_SIZE)
// randomly shuffle
data = np.random.shuffle(data)
Are there any ways that I can do it fast and efficiently? Thank you.
UPDATE: I have approximately 30,000,000 rows and it has been sorted by time.
Upvotes: 1
Views: 350
Reputation: 207678
Here's a concept...
Generate 30,000,000 line CSV with Perl - takes 11 seconds on my Mac:
perl -E 'for($i=0;$i<30000000;$i++){say "Line $i,field2,field3,",int rand 100}' > BigBoy.csv
Sample Output
Line 0,field2,field3,49
Line 1,field2,field3,6
Line 2,field2,field3,15
...
Line 29999998,field2,field3,79
Line 29999999,field2,field3,19
Take 1% of the lines and shuffle them - takes 3 seconds and 15MB of RAM:
awk 'rand()>0.99' BigBoy.csv | gshuf > RandomSet.csv
RandomSet.csv
contains 299,748 lines:
Sample Output
Line 15348259,field2,field3,95
Line 1642442,field2,field3,93
Line 29199452,field2,field3,52
gshuf
installed on Mac using homebrew:
brew install coreutils
Upvotes: 2