Jose G
Jose G

Reputation: 343

Using multiprocessing map with a pandas dataframe?

I am using (python's) panda's map function to process a big CSV file (~50 gigabytes), like this:

import pandas as pd

df = pd.read_csv("huge_file.csv")
df["results1"], df["results2"] = df.map(foo)
df.to_csv("output.csv")

Is there a way I can use parallelization on this? Perhaps using multiprocessing's map function?

Thanks, Jose

Upvotes: 0

Views: 1771

Answers (1)

Jeff
Jeff

Reputation: 128948

See docs on reading by chunks here, example here, and appending here

You are much better off reading your csv in chunks, processing, then writing it out to a csv (of course you evven better off converting to HDF).

  • Takes a relatively constant amount of memory
  • efficient, can be done in parallel (usually requires having an HDF file that you can select sections from though; a csv is not good for this).
  • less complicated that trying to do multi-processing directly

Upvotes: 2

Related Questions