Reputation: 4263
I have an input file like this:
a
1,100
2,200
3,300
b
1,100,200
2,200,300
3,300,400
c
...
I want to read the file into multiple data frames, with code like this (to simplify the problem, we assume the number of rows for each table is fixed):
import pandas as pd
with open("file.csv", "r") as f:
while True:
table_name = f.readline()
if table_name:
table_df = pd.read_csv(f, nrows=3)
# Do other stuff
else:
break
My initial expectation was that pd.read_csv(f, nrows=3)
consumes only limited number of rows from the input stream and the next f.readline()
call would read on. However, it turns out that after the first read_csv
call, the stream position of f
is set to the end of the file and I can no longer read from the same stream f
. My pandas version is 0.25.0. Is this a bug or an expected behaviour? Is there any way to reuse the same input stream to read multiple data frames?
Upvotes: 0
Views: 865
Reputation: 4418
Don't know why I didn't think of this earlier.
Setting iterator=True
returns an iterator over the csv file. Then use get_chunk()
to select how many lines to read:
reader = pd.read_csv(f, iterator=True)
reader.get_chunk(2)
i j k
returns -> 0 1 2 4
1 2 4 8
reader.get_chunk(3)
i j k
2 3 6 12
returns -> 3 4 8 16
4 5 10 20
Upvotes: 0
Reputation: 4418
Use the csv
module in the Python standard library and use send
to indicate the number of rows you want to a generator function:
import csv
import pandas as pd
def csvreader(filename):
with open(filename) as csvfile:
reader = csv.DictReader(csvfile)
count = yield
while True:
rows = []
for n,row in enumerate(reader):
rows.append(row)
if n == count:
break
count = yield(pd.DataFrame(rows))
testfile.csv:
i, j, k
1, 2, 4
2, 4, 8
3, 6, 12
4, 8, 16
. . .
Set up the generator
x = csvreader(s)
next(x)
Request next 2 rows:
x.send(2)
#returned DataFrame
i j k
0 1 2 4
1 2 4 8
Request next 3 rows:
x.send(3)
#returned DataFrame
i j k
0 3 6 12
1 4 8 16
2 5 10 20
Note the index starts over each time. This could be fixed by specifying a column as the index (add a running counter to each row if needed):
count = yield(pd.DataFrame(rows), index=<some column name>)
Upvotes: 1
Reputation: 92854
pandas.read_csv will create a file reader object from filepath_or_buffer
argument at once and nrows=
param just gives an ability to get a slice from the reader (it can not re-instantiate a new reader from the same file-object)
By file-like object, we refer to objects with a
read()
method, such as a file handler (e.g. via builtinopen
function) orStringIO
.
Basing on your input file format, let's suppose that a table_name
is treated as a row with single string without separator ,
(i.e. a
, b
).
You can achieve the needed result with manually passing a slice of rows to read_csv
constructor:
import pandas as pd
import io
from itertools import islice
with open("file.csv", "r") as f:
dfs = []
while True:
table_name = f.readline().strip()
if table_name and ',' not in table_name:
data = ''.join(islice(f, 3)).strip()
table_df = pd.read_csv(io.StringIO(data), sep=',', header=None)
dfs.append([table_name, table_df])
else:
break
# check results
for t_name, df in dfs:
print('---', t_name)
print(df)
Sample output:
--- a
0 1
0 1 100
1 2 200
2 3 300
--- b
0 1 2
0 1 100 200
1 2 200 300
2 3 300 400
Upvotes: 3