Reputation: 8481
I have a dataframe that has columns country
, date
, and users
(i.e. number of users). I want to do a diff along the date
column, but re-start the calculation for each country
. How to do this?
Example data might look like this:
date country users
2015-03-01 US 3
2015-03-02 US 7
2015-03-03 US 9
2015-03-04 US 11
2015-03-01 FR 4
2015-03-02 FR 8
2015-03-03 FR 12
2015-03-04 FR 16
The diff should give this:
2015-03-01 US NaN
2015-03-02 US 4
2015-03-03 US 2
2015-03-04 US 2
2015-03-01 FR NaN
2015-03-02 FR 4
2015-03-03 FR 4
2015-03-04 FR 4
Upvotes: 1
Views: 416
Reputation: 353389
As mentioned in the comments, this is a groupby
problem (see here in the docs for the split-apply-combine pattern).
In your example, we want to group by the country column, then do a diff along the users column (you say along the date column, but that doesn't seem to match your expected output):
>>> df["new_diff"] = df.groupby("country")["users"].diff()
>>> df
date country users new_diff
0 2015-03-01 US 3 NaN
1 2015-03-02 US 7 4
2 2015-03-03 US 9 2
3 2015-03-04 US 11 2
4 2015-03-01 FR 4 NaN
5 2015-03-02 FR 8 4
6 2015-03-03 FR 12 4
7 2015-03-04 FR 16 4
Note that in a real problem you'd have to decide on what you want to do about missing days and so on (and I tend to throw in a sort
on columns like date
to ensure that things are in the order I think they are.)
Upvotes: 3