Reputation: 4263

Unable to Reuse Input Stream after read_csv Call in Pandas

I have an input file like this:

a
1,100
2,200
3,300
b
1,100,200
2,200,300
3,300,400
c
...

I want to read the file into multiple data frames, with code like this (to simplify the problem, we assume the number of rows for each table is fixed):

import pandas as pd

with open("file.csv", "r") as f:
    while True:
        table_name = f.readline()
        if table_name:
            table_df = pd.read_csv(f, nrows=3)
            # Do other stuff
        else:
            break

My initial expectation was that pd.read_csv(f, nrows=3) consumes only limited number of rows from the input stream and the next f.readline() call would read on. However, it turns out that after the first read_csv call, the stream position of f is set to the end of the file and I can no longer read from the same stream f. My pandas version is 0.25.0. Is this a bug or an expected behaviour? Is there any way to reuse the same input stream to read multiple data frames?

Upvotes: 0

Answers (3)

RootTwo

Reputation: 4418

Don't know why I didn't think of this earlier. Setting iterator=True returns an iterator over the csv file. Then use get_chunk() to select how many lines to read:

reader = pd.read_csv(f, iterator=True)

reader.get_chunk(2)
                            i   j   k
        returns ->     0    1   2   4
                       1    2   4   8
reader.get_chunk(3)
                            i   j   k
                       2    3   6   12
        returns ->     3    4   8   16
                       4    5   10  20

Upvotes: 0

RootTwo

Reputation: 4418

Use the csv module in the Python standard library and use send to indicate the number of rows you want to a generator function:

import csv
import pandas as pd

def csvreader(filename):
    with open(filename) as csvfile:
        reader = csv.DictReader(csvfile)

        count = yield

        while True:
            rows = []
            for n,row in enumerate(reader):
                rows.append(row)
                if n == count:
                    break

            count = yield(pd.DataFrame(rows))

testfile.csv:

i, j, k
1, 2, 4
2, 4, 8
3, 6, 12
4, 8, 16
. . .

Set up the generator

x = csvreader(s)
next(x)

Request next 2 rows:

x.send(2)
               #returned DataFrame
                    i   j   k
               0    1   2   4
               1    2   4   8

Request next 3 rows:

x.send(3)
               #returned DataFrame

                    i   j   k
               0    3   6   12
               1    4   8   16
               2    5   10  20

Note the index starts over each time. This could be fixed by specifying a column as the index (add a running counter to each row if needed):

count = yield(pd.DataFrame(rows), index=<some column name>)

Upvotes: 1

RomanPerekhrest

Reputation: 92854

pandas.read_csv will create a file reader object from filepath_or_buffer argument at once and nrows= param just gives an ability to get a slice from the reader (it can not re-instantiate a new reader from the same file-object)

By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.

Basing on your input file format, let's suppose that a table_name is treated as a row with single string without separator , (i.e. a, b). You can achieve the needed result with manually passing a slice of rows to read_csv constructor:

import pandas as pd
import io
from itertools import islice

with open("file.csv", "r") as f:
    dfs = []
    while True:
        table_name = f.readline().strip()
        if table_name and ',' not in table_name:
            data = ''.join(islice(f, 3)).strip()
            table_df = pd.read_csv(io.StringIO(data), sep=',', header=None)
            dfs.append([table_name, table_df])
        else:
            break

# check results
for t_name, df in dfs:
    print('---', t_name)
    print(df)

Sample output:

--- a
   0    1
0  1  100
1  2  200
2  3  300
--- b
   0    1    2
0  1  100  200
1  2  200  300
2  3  300  400

Upvotes: 3

Unable to Reuse Input Stream after read_csv Call in Pandas

Answers (3)

Related Questions