Reputation: 5223

Pandas groupby + transform taking hours for 600 Million records

My DataFrame 3 fields are account ,month and salary.

account month              Salary
1       201501             10000
2       201506             20000
2       201506             20000
3       201508             30000
3       201508             30000
3       201506             10000
3       201506             10000
3       201506             10000
3       201506             10000

I am doing groupby on Account and Month and calculating sum of salary for group. Then removing duplicates.

MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(sum)
MyDataFrame = MyDataFrame.drop_duplicates()

Expecting output like below:

account month              Salary
1       201501             10000
2       201506             40000
3       201508             60000
3       201506             40000

It works well for few records. I tried same for 600 Million records and it is in progress since 4-5 hours. Initially when I loaded data using pd.read_csv() data acquired 60 GB RAM, till 1-2 hour RAM usages was in between 90 to 120 GB. After 3 hours process is taking 236 GB RAM and it is still running.

Please suggest if any other alternative faster way is available for this.

EDIT: Now 15 Minutes in df.groupby(['account', 'month'], sort=False)['Salary'].sum()

Upvotes: 2

Answers (3)

AZhao

Reputation: 14415

Might be worth downloading the development version of Pandas 0.17.0. They are unlocking the GIL, which controls multi threading. It's going to be natively implemented in the groupby and this blog post suggested speed increases of 3x on a group-mean example.

http://continuum.io/blog/pandas-releasing-the-gil

http://pandas.pydata.org/

Upvotes: 2

dagrha

Reputation: 2559

Just to follow up on chrisb's answer and Alexander's comment, you indeed will get more performance out of the .sum() and .agg('sum') methods. Here's a Jupyter %%timeit output for the three:

So, the answers that chrisb and Alexander mention are about twice as fast on your very small example dataset.

Also, according to the Pandas API documentation, adding the kwarg sort=False will also help performance. So, your groupby should look something like df.groupby(['account', 'month'], sort=False)['Salary'].sum(). Indeed, when I ran it, it was about 10% faster than the runs shown in the above image.

Upvotes: 2

chrisb

Reputation: 52246

Unless I'm misunderstanding something, you're really doing an aggregation - transform is for when you need the data in the shape as the original frame. This should be somewhat faster and does it all in one step.

df.groupby(['account', 'month'])['Salary'].agg('sum')

Upvotes: 2

Pandas groupby + transform taking hours for 600 Million records

Answers (3)

Related Questions