Sid
Sid

Reputation: 4055

How to use chunksize with an offset in pandas?

Reading a large csv file using pandas, I want to use chunksize to limit the number of rows read in at a time but on the second iteration I would like to keep 300 rows from the previous chunk.

Is there a way to do this in read_csv?

Upvotes: 0

Views: 697

Answers (1)

Corralien
Corralien

Reputation: 120469

Use read_csv with chunksize=XXX parameter. At each iteration, save last 300 rows for next iteration and concatenate them with new XXX rows:

chunk_size = 5  # 1000
overlap_size = 3  # 300
prev_chunk = pd.DataFrame()

with pd.read_csv('data.csv', chunksize=chunk_size) as reader:
    data = []
    prev_chunk = pd.DataFrame()
    for i, chunk in enumerate(reader, 1):
        df = pd.concat([prev_chunk, chunk])
        prev_chunk = chunk[-window_size:]
        # Do whatever you want with df
        # res = process_data(df)
        # data.append(res)
        print(f'Iteration {i}:\n{df}\n')

df = pd.concat(data)

Output:

Iteration 1:
    colA
0  line1
1  line2
2  line3
3  line4
4  line5

Iteration 2:
     colA
2   line3
3   line4
4   line5
5   line6
6   line7
7   line8
8   line9
9  line10

Iteration 3:
      colA
7    line8
8    line9
9   line10
10  line11
11  line12
12  line13
13  line14
14  line15

Iteration 4:
      colA
12  line13
13  line14
14  line15
15  line16

Upvotes: 1

Related Questions