Reputation: 661
I am fairly familiar with R dplyr for data analysis and I am trying to convert some of the code I have written in dplyr to pandas. I have data that has a person identified by the ID column and DATE on which that person used a certain PRODUCT. I am trying to figure out the best way to convert the following R dplyr code to python pandas code. Basically, I group by the ID column, filter for a certain type of product, and then add a column that (for each row in the group) adds a minimum (first used) date and a maximum (last used) date for that person and product. Finally, I also add a column that calculates the number of days between the last used and first used date. Here is the data:
ID PRODUCT DATE
A ITEM1 1/30/15
B ITEM1 2/23/14
A ITEM2 3/22/15
C ITEM1 1/23/12
B ITEM1 4/12/15
A ITEM3 2/2/14
C ITEM1 1/1/17
A ITEM1 2/20/15
A ITEM1 5/18/15
using dplyr I can do
library(dplyr)
library(lubridate)
df <- df %>%
mutate(DATE = mdy(DATE)) %>%
group_by(ID) %>%
filter(PRODUCT == "Item1") %>%
mutate(FIRST = min(DATE), LAST = max(DATE), DAYS = LAST - FIRST)
which gives me
ID PRODUCT DATE FIRST LAST DAYS
(fctr) (fctr) (time) (time) (time) (dfft)
1 A ITEM1 2015-01-30 2015-01-30 2015-05-18 108 days
2 B ITEM1 2014-02-23 2014-02-23 2015-04-12 413 days
3 C ITEM1 2012-01-23 2012-01-23 2017-01-01 1805 days
4 B ITEM1 2015-04-12 2014-02-23 2015-04-12 413 days
5 C ITEM1 2017-01-01 2012-01-23 2017-01-01 1805 days
6 A ITEM1 2015-02-20 2015-01-30 2015-05-18 108 days
7 A ITEM1 2015-05-18 2015-01-30 2015-05-18 108 days
Data:
df <- structure(list(ID = structure(c(1L, 2L, 1L, 3L, 2L, 1L, 3L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
PRODUCT = structure(c(1L, 1L, 2L, 1L, 1L, 3L, 1L, 1L, 1L), .Label = c("ITEM1", "ITEM2", "ITEM3"), class = "factor"),
DATE = structure(c(3L, 6L, 7L, 2L, 8L, 4L, 1L, 5L, 9L),
.Label = c("1/1/17", "1/23/12", "1/30/15", "2/2/14", "2/20/15", "2/23/14", "3/22/15", "4/12/15", "5/18/15"),
class = "factor")),
.Names = c("ID", "PRODUCT", "DATE"), class = "data.frame", row.names = c(NA, -9L))
How can I do the same in pandas?
Upvotes: 5
Views: 2522
Reputation: 28709
Another option, using the transform
function, in conjunction with assign
:
(df.loc[df.PRODUCT == 'ITEM1']
.assign(first = lambda df: df.groupby('ID').DATE.transform('min'),
last = lambda df: df.groupby('ID').DATE.transform('max'),
days = lambda df: df['last'] - df['first'])
)
ID PRODUCT DATE first last days
0 A ITEM1 2015-01-30 2015-01-30 2015-05-18 108 days
1 B ITEM1 2014-02-23 2014-02-23 2015-04-12 413 days
3 C ITEM1 2012-01-23 2012-01-23 2017-01-01 1805 days
4 B ITEM1 2015-04-12 2014-02-23 2015-04-12 413 days
6 C ITEM1 2017-01-01 2012-01-23 2017-01-01 1805 days
7 A ITEM1 2015-02-20 2015-01-30 2015-05-18 108 days
8 A ITEM1 2015-05-18 2015-01-30 2015-05-18 108 days
Upvotes: 0
Reputation: 3835
It's pretty easy to translate your R code into python with datar
:
>>> from datar.all import f, tribble, as_date, group_by, mutate, min, max, filter
[2021-06-24 13:44:46][datar][WARNING] Builtin name "min" has been overriden by datar.
[2021-06-24 13:44:46][datar][WARNING] Builtin name "max" has been overriden by datar.
[2021-06-24 13:44:46][datar][WARNING] Builtin name "filter" has been overriden by datar.
>>>
>>> df = tribble(
... f.ID, f.PRODUCT, f.DATE,
... "A", "ITEM1", "1/30/15",
... "B", "ITEM1", "2/23/14",
... "A", "ITEM2", "3/22/15",
... "C", "ITEM1", "1/23/12",
... "B", "ITEM1", "4/12/15",
... "A", "ITEM3", "2/2/14",
... "C", "ITEM1", "1/1/17",
... "A", "ITEM1", "2/20/15",
... "A", "ITEM1", "5/18/15",
... )
>>> df
ID PRODUCT DATE
<object> <object> <object>
0 A ITEM1 1/30/15
1 B ITEM1 2/23/14
2 A ITEM2 3/22/15
3 C ITEM1 1/23/12
4 B ITEM1 4/12/15
5 A ITEM3 2/2/14
6 C ITEM1 1/1/17
7 A ITEM1 2/20/15
8 A ITEM1 5/18/15
>>> df >> mutate(
... DATE=as_date(f.DATE, "%m/%d/%y")
... ) >> group_by(
... f.ID
... ) >> filter(
... f.PRODUCT == "ITEM1"
... ) >> mutate(
... FIRST=min(f.DATE),
... LAST=max(f.DATE),
... DAYS=f.LAST - f.FIRST
... )
ID PRODUCT DATE FIRST LAST DAYS
<object> <object> <object> <object> <object> <timedelta64[ns]>
0 A ITEM1 2015-01-30 2015-01-30 2015-05-18 108 days
1 B ITEM1 2014-02-23 2014-02-23 2015-04-12 413 days
2 C ITEM1 2012-01-23 2012-01-23 2017-01-01 1805 days
3 B ITEM1 2015-04-12 2014-02-23 2015-04-12 413 days
4 C ITEM1 2017-01-01 2012-01-23 2017-01-01 1805 days
5 A ITEM1 2015-02-20 2015-01-30 2015-05-18 108 days
6 A ITEM1 2015-05-18 2015-01-30 2015-05-18 108 days
[Groups: ID (n=3)]
Disclaimer: I am the author of the datar
package.
Upvotes: 0
Reputation: 294488
using agg
+ groupby
funcs = dict(FIRST='min', LAST='max', DAYS=np.ptp)
d1 = df.join(df.groupby(['ID', 'PRODUCT']).DATE.agg(funcs), on=['ID', 'PRODUCT'])
Upvotes: 5