Reputation: 408
could you help me, I faced a problem of reading random rows from the large csv file using 0.18.1 pandas and 2.7.10 Python on Windows (8 Gb RAM).
In Read a small random sample from a big CSV file into a Python data frame I saw an approach, however, it occured for my PC to be very memory consuming, namely, part of the code:
n = 100
s = 10
skip = sorted(rnd.sample(xrange(1, n), n-s))# skip n-s random rows from *.csv
data = pd.read_csv(path, usecols = ['Col1', 'Col2'],
dtype = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip)
so, if I want to take some random rows from the file considering not only 100 rows, but 100 000, it becomes hard, however taking not random rows from the file is almost alright:
skiprows = xrange(100000)
data = pd.read_csv(path, usecols = ['Col1', 'Col2'],
dtype = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip, nrows = 10000)
So the question how can I deal with reading large number of random rows from the large csv file with pandas, i.e. since I can't read the entire csv file, even with chunking it, I'm interested exactly in random rows. Thanks
Upvotes: 3
Views: 4709
Reputation: 5222
I think this is faster than other methods showed here and may be worth trying.
Say, we have already chosen rows to be skipped in a list skipped
. First, I convert it to a lookup bool table.
# Some preparation:
skipped = np.asarray(skipped)
# MAX >= number of rows in the file
bool_skipped = np.zeros(shape(MAX,), dtype=bool)
bool_skipped[skipped] = True
Main stuff:
from io import StringIO
# in Python 2 use
# from StringIO import StringIO
def load_with_buffer(filename, bool_skipped, **kwargs):
s_buf = StringIO()
with open(filename) as file:
count = -1
for line in file:
count += 1
if bool_skipped[count]:
continue
s_buf.write(line)
s_buf.seek(0)
df = pd.read_csv(s_buf, **kwargs)
return df
I tested it as follows:
df = pd.DataFrame(np.random.rand(100000, 100))
df.to_csv('test.csv')
df1 = load_with_buffer('test.csv', bool_skipped, index_col=0)
with 90% of rows skipped. It performs comparably to
pd.read_csv('test.csv', skiprows=skipped, index_col=0)
and is about 3-4 times faster than using dask or reading in chunks.
Upvotes: 1
Reputation: 6519
If memory is the biggest issue, a possible solution might be to use chunks, and randomly select from the chunks
n = 100
s = 10
factor = 1 # should be integer
chunksize = int(s/factor)
reader = pd.read_csv(path, usecols = ['Col1', 'Col2'],dtype = {'Col1': 'int32', 'Col2':'int32'}, chunksize=chunksize)
out = []
tot = 0
for df in reader:
nsample = random.randint(factor,chunksize)
tot += nsample
if tot > s:
nsample = s - (tot - nsample)
out.append(df.sample(nsample))
if tot >= s:
break
data = pd.concat(out)
And you can use factor to control the sizes of the chunks.
Upvotes: 2