Reputation: 4055
Reading a large csv
file using pandas
, I want to use chunksize
to limit the number of rows read in at a time but on the second iteration I would like to keep 300 rows
from the previous chunk.
Is there a way to do this in read_csv
?
Upvotes: 0
Views: 697
Reputation: 120469
Use read_csv
with chunksize=XXX
parameter. At each iteration, save last 300 rows for next iteration and concatenate them with new XXX rows:
chunk_size = 5 # 1000
overlap_size = 3 # 300
prev_chunk = pd.DataFrame()
with pd.read_csv('data.csv', chunksize=chunk_size) as reader:
data = []
prev_chunk = pd.DataFrame()
for i, chunk in enumerate(reader, 1):
df = pd.concat([prev_chunk, chunk])
prev_chunk = chunk[-window_size:]
# Do whatever you want with df
# res = process_data(df)
# data.append(res)
print(f'Iteration {i}:\n{df}\n')
df = pd.concat(data)
Output:
Iteration 1:
colA
0 line1
1 line2
2 line3
3 line4
4 line5
Iteration 2:
colA
2 line3
3 line4
4 line5
5 line6
6 line7
7 line8
8 line9
9 line10
Iteration 3:
colA
7 line8
8 line9
9 line10
10 line11
11 line12
12 line13
13 line14
14 line15
Iteration 4:
colA
12 line13
13 line14
14 line15
15 line16
Upvotes: 1