mischva11
mischva11

Reputation: 2956

Reading specific chunks pandas / not reading all chunks in pandas

I am trying to use accordingly to this question and answer reading a large csv file by chunks and processing it. Since I'm not native with python I got an optimization problem and looking for a better solution here.


What my code does:

I read in the line count of my csv with

with open(file) as f:
    row_count = sum(1 for line in f)

afterwards I "slice" my data in 30 equal sized chunks and process it accordingly to the linked answer with a for loop and pd.read_csv(file, chunksize). Since plotting 30 graphs in one is pretty unclear, I plot it every 5 steps with modulo (which may be variated). For this I use an external counter.

chunksize = row_count // 30
counter = 0
for chunk in pd.read_csv(file, chunksize=chunksize):
    df = chunk
    print(counter)
    if ((counter % 5) == 0 | (counter == 0):
        plt.plot(df["Variable"])
    counter = counter +1
plt.show()

Now to my question:

It seems like, this loop reads the chunk size in before processing the loop, which is reasonable. I can see this, since the print(counter) steps are also fairly slow. Since I read a few million rows of a csv, it takes some time every step. Is there a way to skip the not wanted chunks in the for loop, before reading it in? I was trying out something like:

wanted_plts <- [1,5,10,15,20,25,30]
for i in wanted_plts:
   for chunk[i] in pd.read_csv(file, chunksize=chunksize):
   .
   .

I think I have understanding issues how I can manipulate this syntax of the for loop range. There should be an elegant way to fix this.

Also: i found the .get_chunk(x) by pandas but this seems to create just one chunk of size x.

Another attempt by me is trying to subset the reader object of pd.read_csv like pd.read_csv()[0,1,2] but it seems that's not possible too.


Amendment: I'm aware plotting a lot of data in matplotlib is slow. I preprocess it earlier, but for making this code readable I removed all unnecessary parts.

Upvotes: 4

Views: 6287

Answers (3)

igrinis
igrinis

Reputation: 13656

You are wasting a lot of resources when parsing CSV into DataFrame without using it. To avoid this you can create line index during the first pass:

fp = open(file_name)
row_count = 0
pos = {0: 0}
line = fp.readline()
while line:
    row_count += 1
    pos[row_count] = fp.tell()
    line = fp.readline()

Do not dispose the file handle yet! Because read_csv() accepts streams, you can move your file pointer as you want:

chunksize = row_count // 30
wanted_plts = [1,5,10,15,20,25,30]
for i in wanted_plts: 
    fp.seek(pos[i*chunksize])  # this will bring you to the first line of the desired chunk
    obj = pd.read_csv(fp, chunksize=chunksize)  # read your chunk lazily
    df = obj.get_chunk()  # convert to DataFrame object
    plt.plot(df["Variable"]) # do something  

fp.close()  # Don't forget to close the file when finished.

And finally a warning: when reading CSV this way you will lose column names. So make an adjustment:

 obj = pd.read_csv(fp, chunksize=chunksize, names=[!!<column names you have>!!])

P.S. file is a reserved word, avoid using it to prevent undesired side effects. You can use file_ or file_name instead.

Upvotes: 6

Valentino
Valentino

Reputation: 7361

As far as I know, pandas does not provide any support for skipping chunks of file. At least I never found anything about it in the documentation.

In general, skipping lines from file (not reading them at all) is difficult unless you know in advance how many lines you want to skip and how many characters you have in each of those lines. In this case you can try to play with IO and seek to move the stream position to the exact place you need the next iteration.
But it does not seem your case.

I think the best thing you can do to improve efficiency is to read the lines using standard IO, and convert to a dataframe only the lines you need / want to plot.

Consider for example the following custom iterator.
When instantiated, it saves the header (first line). Each iteration it reads a chunk of lines from the file and then skip the following n*chunksize lines. It returns the header line followed by the read lines, wrapped in a io.StringIO object (so it's a stream and can be fed directly to pandas.read_csv).

import io
from itertools import islice

class DfReaderChunks:
    def __init__(self, filename, chunksize, n):
        self.fo = open(filename)
        self.chs = chunksize
        self.skiplines = self.chs * n
        self.header = next(self.fo)

    def getchunk(self):
        ll = list(islice(self.fo, self.chs))
        if len(ll) == 0:
            raise StopIteration
        dd = list(islice(self.fo, self.skiplines))
        return self.header + ''.join(ll)

    def __iter__(self):
        return self

    def __next__(self):
        return io.StringIO(self.getchunk())

    def close(self):
        self.fo.close()

    def __del__(self):
        self.fo.close()

Using this class, your can read from your file:

reader = DfReaderChunks(file, chunksize, 4)
for dfst in reader:
    df = pd.read_csv(dfst)
    print(df) #here I print to stdout, you can plot
reader.close()

which is "equivalent" to your setup:

for chunk in pd.read_csv(file, chunksize=chunksize):
    df = chunk
    if (counter % 5 == 0):
        print(df) #again I print, you can plot
    counter += 1

I tested the time used by both the above snippets using a dataframe of 39 Mb (100000 rows or random numbers).

On my machine, the former takes 0.458 seconds, the latter 0.821 seconds.

The only drawback is that the former snippet loses track of the row index (it's a new dataframe each time, so index always start from 0) but the printed chunks are the same.

Upvotes: 2

Basile
Basile

Reputation: 131

I've toyed with your setup, trying to find a way to skip chunks, using another rendering library like pyqtgraph or using matplotlib.pyplot subroutines instead of plot(), all to no avail.

So the only fair advice I can give you is to limit the scope of read_csv to only the data you're interested in by passing the usecols parameter.

Instead of:

for chunk in pd.read_csv(file, chunksize=chunksize):
    plt.plot(chunk['Variable'])

Use:

for chunk in pd.read_csv(file, usecols=['Variable'], chunksize=chunksize):
    plt.plot(chunk)

And, if you haven't already, definitely limit the number of iterations by going for the biggest chunksize you possibly can (so in your case the lowest row_count divider).

I haven't quantified their respective weight but you will gain on both the csv_read() and the plot() method overheads, even ever so slightly due to the fact that your current chunks are already quite big.

With my test data, quadrupling the chunksize cuts down processing time in half:

chunksize=1000 => executed in 12.7s
chunksize=2000 => executed in 9.06s
chunksize=3000 => executed in 7.68s
chunksize=4000 => executed in 6.94s

And specifying usecols at read time also cuts down processing time in half again:

chunksize=1000 + usecols=['Variable'] => executed in 8.33s
chunksize=2000 + usecols=['Variable'] => executed in 5.27s
chunksize=3000 + usecols=['Variable'] => executed in 4.39s
chunksize=4000 + usecols=['Variable'] => executed in 3.54s

Upvotes: 1

Related Questions