Reputation: 41
Due to the huge data size, we used pandas to process data, but a very strange phenomenon occurred. The pseudo code looks like this:
reader = pd.read_csv(IN_FILE, chunksize = 1000, engine='c')
for chunk in reader:
result = []
for line in chunk.tolist():
temp = complicated_process(chunk) # this involves a very complicated processing, so here is just a simplified version
result.append(temp)
chunk['new_series'] = pd.series(result)
chunk.to_csv(OUT_TILE, index=False, mode='a')
We can confirm each loop of result is not empty. But only in the first time of the loop, line chunk['new_series'] = pd.series(result)
has result, the rest are empty. Therefore, only the first chunk of the output contains new_series, the rest are empty.
Did we miss anything here? Thanks in advance.
Upvotes: 2
Views: 545
Reputation: 109626
You should declare result
above your loop, otherwise you are just re-initializing it with each chunk.
result = []
for chunk in reader:
...
You previous method is functionally equivalent to:
for chunk in reader:
del result # because it is being re-assigned on the following line.
result = []
result.append(something)
print(result) # Only shows result from last chunk in reader (the last loop).
Also, I would recommend:
chunk = chunk.assign(new_series=result) # Instead of `chunk['new_series'] = pd.series(result)`.
I am assuming you are doing something with the line
variable in your for loop
, even though it is not used in your example above.
Upvotes: 3
Reputation: 41
A better solution would be this:
reader = pd.read_csv(IN_FILE, chunksize = 1000, engine='c')
for chunk in reader:
result = []
for line in chunk.tolist():
temp = complicated_process(chunk) # this involves a very complicated processing, so here is just a simplified version
result.append(temp)
new_chunk = chunk.reset_index()
new_chunk = new_chunk.assign(new_series=result)
new_chunk.to_csv(OUT_TILE, index=False, mode='a')
Notice: the index of each chunk is not individual, but is derived the whole file. If we append a new series from each loop, the chunk will inherit the index from the whole file. Therefore, the index of each chunk and the new series does not match.
The solution by @Alexander works, but the result
might become huge, so it will occupy too much memory.
The new solution here will reset index for each chunk by doing new_chunk = chunk.reset_index()
, and result
will be reset within each loop. This saves a lot of memory.
Upvotes: 1