Reading random rows of a large csv file, python, pandas

Question

could you help me, I faced a problem of reading random rows from the large csv file using 0.18.1 pandas and 2.7.10 Python on Windows (8 Gb RAM).

In Read a small random sample from a big CSV file into a Python data frame I saw an approach, however, it occured for my PC to be very memory consuming, namely, part of the code:

n = 100
s = 10
skip = sorted(rnd.sample(xrange(1, n), n-s))# skip n-s random rows from *.csv       
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip)

so, if I want to take some random rows from the file considering not only 100 rows, but 100 000, it becomes hard, however taking not random rows from the file is almost alright:

skiprows = xrange(100000)    
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip, nrows = 10000)

So the question how can I deal with reading large number of random rows from the large csv file with pandas, i.e. since I can't read the entire csv file, even with chunking it, I'm interested exactly in random rows. Thanks

ptrj · Accepted Answer

I think this is faster than other methods showed here and may be worth trying.

Say, we have already chosen rows to be skipped in a list skipped. First, I convert it to a lookup bool table.

# Some preparation:
skipped = np.asarray(skipped)
# MAX >= number of rows in the file
bool_skipped = np.zeros(shape(MAX,), dtype=bool)
bool_skipped[skipped] = True

Main stuff:

from io import StringIO
# in Python 2 use
# from StringIO import StringIO

def load_with_buffer(filename, bool_skipped, **kwargs):
    s_buf = StringIO()
    with open(filename) as file:
        count = -1
        for line in file:
            count += 1
            if bool_skipped[count]:
                continue
            s_buf.write(line)
    s_buf.seek(0)
    df = pd.read_csv(s_buf, **kwargs)
    return df

I tested it as follows:

df = pd.DataFrame(np.random.rand(100000, 100))
df.to_csv('test.csv')

df1 = load_with_buffer('test.csv', bool_skipped, index_col=0)

with 90% of rows skipped. It performs comparably to

pd.read_csv('test.csv', skiprows=skipped, index_col=0)

and is about 3-4 times faster than using dask or reading in chunks.

Reading random rows of a large csv file, python, pandas

Answers (2)

Related Questions