acepor
acepor

Reputation: 41

strange indexing mechanism of pandas.read_csv function with chunksize option

Due to the huge data size, we used pandas to process data, but a very strange phenomenon occurred. The pseudo code looks like this:

reader = pd.read_csv(IN_FILE, chunksize = 1000, engine='c')
for chunk in reader:
    result = []
    for line in chunk.tolist():
         temp = complicated_process(chunk)  # this involves a very complicated processing, so here is just a simplified version
         result.append(temp)
    chunk['new_series'] = pd.series(result)
    chunk.to_csv(OUT_TILE, index=False, mode='a')

We can confirm each loop of result is not empty. But only in the first time of the loop, line chunk['new_series'] = pd.series(result) has result, the rest are empty. Therefore, only the first chunk of the output contains new_series, the rest are empty.

Did we miss anything here? Thanks in advance.

Upvotes: 2

Views: 545

Answers (2)

Alexander
Alexander

Reputation: 109626

You should declare result above your loop, otherwise you are just re-initializing it with each chunk.

result = []
for chunk in reader:
    ...

You previous method is functionally equivalent to:

for chunk in reader:
    del result  # because it is being re-assigned on the following line.
    result = []
    result.append(something)
print(result)  # Only shows result from last chunk in reader (the last loop).

Also, I would recommend:

chunk = chunk.assign(new_series=result)  # Instead of `chunk['new_series'] = pd.series(result)`.

I am assuming you are doing something with the line variable in your for loop, even though it is not used in your example above.

Upvotes: 3

acepor
acepor

Reputation: 41

A better solution would be this:

reader = pd.read_csv(IN_FILE, chunksize = 1000, engine='c')
for chunk in reader:
    result = []
    for line in chunk.tolist():
        temp = complicated_process(chunk)  # this involves a very complicated processing, so here is just a simplified version
        result.append(temp)
    new_chunk = chunk.reset_index()
    new_chunk = new_chunk.assign(new_series=result)
    new_chunk.to_csv(OUT_TILE, index=False, mode='a')

Notice: the index of each chunk is not individual, but is derived the whole file. If we append a new series from each loop, the chunk will inherit the index from the whole file. Therefore, the index of each chunk and the new series does not match.

The solution by @Alexander works, but the result might become huge, so it will occupy too much memory.

The new solution here will reset index for each chunk by doing new_chunk = chunk.reset_index(), and result will be reset within each loop. This saves a lot of memory.

Upvotes: 1

Related Questions