Reputation: 5223
My DataFrame 3 fields are account ,month and salary.
account month Salary
1 201501 10000
2 201506 20000
2 201506 20000
3 201508 30000
3 201508 30000
3 201506 10000
3 201506 10000
3 201506 10000
3 201506 10000
I am doing groupby on Account and Month and calculating sum of salary for group. Then removing duplicates.
MyDataFrame['salary'] = MyDataFrame.groupby(['account'], ['month'])['salary'].transform(sum)
MyDataFrame = MyDataFrame.drop_duplicates()
Expecting output like below:
account month Salary
1 201501 10000
2 201506 40000
3 201508 60000
3 201506 40000
It works well for few records. I tried same for 600 Million records and it is in progress since 4-5 hours. Initially when I loaded data using pd.read_csv() data acquired 60 GB RAM, till 1-2 hour RAM usages was in between 90 to 120 GB. After 3 hours process is taking 236 GB RAM and it is still running.
Please suggest if any other alternative faster way is available for this.
EDIT: Now 15 Minutes in df.groupby(['account', 'month'], sort=False)['Salary'].sum()
Upvotes: 2
Views: 803
Reputation: 14415
Might be worth downloading the development version of Pandas 0.17.0. They are unlocking the GIL, which controls multi threading. It's going to be natively implemented in the groupby and this blog post suggested speed increases of 3x on a group-mean example.
http://continuum.io/blog/pandas-releasing-the-gil
Upvotes: 2
Reputation: 2559
Just to follow up on chrisb's answer and Alexander's comment, you indeed will get more performance out of the .sum()
and .agg('sum')
methods. Here's a Jupyter %%timeit
output for the three:
So, the answers that chrisb and Alexander mention are about twice as fast on your very small example dataset.
Also, according to the Pandas API documentation, adding the kwarg sort=False
will also help performance. So, your groupby should look something like df.groupby(['account', 'month'], sort=False)['Salary'].sum()
. Indeed, when I ran it, it was about 10% faster than the runs shown in the above image.
Upvotes: 2
Reputation: 52246
Unless I'm misunderstanding something, you're really doing an aggregation - transform
is for when you need the data in the shape as the original frame. This should be somewhat faster and does it all in one step.
df.groupby(['account', 'month'])['Salary'].agg('sum')
Upvotes: 2