How to use Pandas diff() on DataFrame that has multiple groups?

Question

I have a dataframe that has columns country, date, and users (i.e. number of users). I want to do a diff along the date column, but re-start the calculation for each country. How to do this?

Example data might look like this:

date country users
2015-03-01 US 3
2015-03-02 US 7
2015-03-03 US 9
2015-03-04 US 11
2015-03-01 FR 4
2015-03-02 FR 8
2015-03-03 FR 12
2015-03-04 FR 16

The diff should give this:

2015-03-01 US NaN
2015-03-02 US 4
2015-03-03 US 2
2015-03-04 US 2
2015-03-01 FR NaN
2015-03-02 FR 4
2015-03-03 FR 4
2015-03-04 FR 4

DSM · Accepted Answer

As mentioned in the comments, this is a groupby problem (see here in the docs for the split-apply-combine pattern).

In your example, we want to group by the country column, then do a diff along the users column (you say along the date column, but that doesn't seem to match your expected output):

>>> df["new_diff"] = df.groupby("country")["users"].diff()
>>> df
         date country  users  new_diff
0  2015-03-01      US      3       NaN
1  2015-03-02      US      7         4
2  2015-03-03      US      9         2
3  2015-03-04      US     11         2
4  2015-03-01      FR      4       NaN
5  2015-03-02      FR      8         4
6  2015-03-03      FR     12         4
7  2015-03-04      FR     16         4

Note that in a real problem you'd have to decide on what you want to do about missing days and so on (and I tend to throw in a sort on columns like date to ensure that things are in the order I think they are.)

How to use Pandas diff() on DataFrame that has multiple groups?

Answers (1)

Related Questions