Pandas: Group-by and Aggregate Column 1 with Condition from Column 2

Question

I'm trying to move from R & dplyr into python and Pandas for some projects, and I'm hoping to figure out how to replicate common coding strategies I used with dplyr.

One common one is that I'll group by a particular column, then calculate a derived column that involves a condition from some third column. Here's a simple example:

dat = data.frame(user = rep(c("1",2,3,4),each=5),
           cancel_date = rep(c(12,5,10,11), each=5)
) %>%
  group_by(user) %>%
  mutate(login = sample(1:cancel_date[1], size = n(), replace = T)) %>%
  ungroup()

-

Source: local data frame [6 x 3]

  user cancel_date login
1    1          12     3
2    1          12     9
3    1          12    12
4    1          12     4
5    1          12     2
6    2           5     4

In this data frame, I'd like to calculate how many logins each user had three months before they cancelled. In dplyr, this is simple:

dat %>%
  group_by(user) %>%
  summarise(logins_three_mos_before_cancel = length(login[cancel_date-login>=3]))

  user logins_three_mos_before_cancel
1    1                              4
2    2                              1
3    3                              5
4    4                              3

But I'm a bit stumped at how to do this pandas. As far as I can tell, aggregate only applies a function on a given grouped column, and I don't know how to get it to apply a function that involves multiple columns.

Here's that same data in pandas:

d = { 'user' : np.repeat([1,2,3,4],5),
     'cancel_date' : np.repeat([12,5,10,11],5),
     'login' : np.array([3,  9, 12,  4,  2,  4,  3,  5,  5,  1,  3,  5,  4,  6,  3,  3,  5, 10,  7, 10])
     }
pd.DataFrame(data=d)

Ami Tavory · Accepted Answer

I hope I followed your R, but do you mean this?

>> df[df.cancel_date - df.login >= 3].user.value_counts().sort_index()
1    4
2    1
3    5
4    3
dtype: int64

Pandas: Group-by and Aggregate Column 1 with Condition from Column 2

Answers (2)

Related Questions