Reputation: 343
I am using (python's) panda's map function to process a big CSV file (~50 gigabytes), like this:
import pandas as pd
df = pd.read_csv("huge_file.csv")
df["results1"], df["results2"] = df.map(foo)
df.to_csv("output.csv")
Is there a way I can use parallelization on this? Perhaps using multiprocessing's map function?
Thanks, Jose
Upvotes: 0
Views: 1771
Reputation: 128948
See docs on reading by chunks here, example here, and appending here
You are much better off reading your csv in chunks, processing, then writing it out to a csv (of course you evven better off converting to HDF).
Upvotes: 2