Elvis
Elvis

Reputation: 425

How to add another condition when using groupby.nth in python?

I have a set of df as below:

enter image description here

Each member has a unique Member Code. My aim is to extract the 1st and 5th visits of every day of every member into a new data frame without using for loop. Any ideas how should I do that?

Part of the sample data on 2018-09-02 is: enter image description here

I tried to use: df.groupby(df["DateTime"].dt.date).nth([0,4]) but the output is :

enter image description here

As you can see, the third and the fifth row are from the same member. Since the member are the same, so how should I do in order to take the next member which is row number 153 with member code 17753?

Here is the small dataframe as an example:

DateTime = ["2018-09-01 00:01:00","2018-09-01 00:35:00","2018-09-01 01:14:00","2018-09-01 01:39:00","2018-09-01 01:46:00",
            "2018-09-02 00:13:00","2018-09-02 00:51:00","2018-09-02 01:23:00","2018-09-02 02:06:00","2018-09-02 02:16:00"]
Duration = [3.0,14.0,9.0,15.0,7.5,1.0,2.0,8.0,3.0,10.0]
Member_Code = [13595,17630,13522,17630,14942,17022,13596,10785,13596,14942]
data = {"DateTime":DateTime, "Duration":Duration, "Member Code":Member_Code}
df1 = pd.DataFrame(data)

df1["DateTime"] = pd.to_datetime(df1["DateTime"], format = "%Y/%m/%d %H:%M")

df1

So based on this example, I want to extract the 2nd and the 4th visit. So, the expected output should be a dataframe consists of the rows with indexes of 1,3,6 and 9 (since the member in row with index 8 is the same as the row with index 6)

EDITED:

DataFrame:

DateTime = ["2018-09-01 00:01:00","2018-09-01 00:35:00","2018-09-01 01:14:00","2018-09-01 01:39:00","2018-09-01 01:46:00",
            "2018-09-02 00:13:00","2018-09-02 00:51:00","2018-09-02 01:23:00","2018-09-02 02:06:00","2018-09-02 02:16:00"]
Duration = [3.0,14.0,9.0,15.0,7.5,1.0,2.0,8.0,3.0,10.0]
Member_Code = [13595,17630,13522,17630,14942,17022,13596,13596,13596,14942]
data = {"DateTime":DateTime, "Duration":Duration, "Member Code":Member_Code}
df1 = pd.DataFrame(data)

df1["DateTime"] = pd.to_datetime(df1["DateTime"], format = "%Y/%m/%d %H:%M")

df1

Solution:

vals = [1, 2]
df_new = df1.groupby(df1["DateTime"].dt.date).nth(vals)
dates = df1["DateTime"].dt.date
mask = df1['Member Code'].eq(dates.map(df1['Member Code'].groupby(dates).nth(vals[1])))
mask1 = df1['Member Code'].groupby(dates).cumcount().ne(vals[0])
df = df1.mask(mask & mask1).groupby(dates).bfill()
df_new = df.groupby(df1["DateTime"].dt.date).nth(vals)
df_new

Answer:

              DateTime            DateTime  Duration  Member Code
DateTime                                                         
2018-09-01  2018-09-01 2018-09-01 00:35:00      14.0      17630.0
2018-09-01  2018-09-01 2018-09-01 01:39:00      15.0      17630.0
2018-09-02  2018-09-02 2018-09-02 00:51:00       2.0      13596.0
2018-09-02  2018-09-02 2018-09-02 02:16:00      10.0      14942.0

Upvotes: 1

Views: 252

Answers (1)

jezrael
jezrael

Reputation: 862771

Use:

DateTime = ["2018-09-01 00:01:00","2018-09-01 00:35:00","2018-09-01 01:14:00","2018-09-01 01:39:00","2018-09-01 01:46:00",
            "2018-09-02 00:13:00","2018-09-02 00:51:00","2018-09-02 01:23:00","2018-09-02 02:06:00","2018-09-02 02:16:00"]
Duration = [3.0,14.0,9.0,15.0,7.5,1.0,2.0,8.0,3.0,10.0]
Member_Code = [13595,17630,13522,17630,14942,17022,13596,13596,13596,14942]
data = {"DateTime":DateTime, "Duration":Duration, "Member Code":Member_Code}
df1 = pd.DataFrame(data)

df1["DateTime"] = pd.to_datetime(df1["DateTime"], format = "%Y/%m/%d %H:%M")
print (df1)
             DateTime  Duration  Member Code
0 2018-09-01 00:01:00       3.0        13595
1 2018-09-01 00:35:00      14.0        17630
2 2018-09-01 01:14:00       9.0        13522
3 2018-09-01 01:39:00      15.0        17630
4 2018-09-01 01:46:00       7.5        14942
5 2018-09-02 00:13:00       1.0        17022
6 2018-09-02 00:51:00       2.0        13596
7 2018-09-02 01:23:00       8.0        13596
8 2018-09-02 02:06:00       3.0        13596
9 2018-09-02 02:16:00      10.0        14942

First compare Member Code of first nth value by all values of column per groups to mask and also exclude first nth value by test by counter with cumcount, pass to DataFrame.mask and back filling missing values per groups, last use original solution:

vals = [1, 2]
dates = df1["DateTime"].dt.date
mask = df1['Member Code'].eq(dates.map(df1['Member Code'].groupby(dates).nth(vals[0])))
mask1 = df1['Member Code'].groupby(dates).cumcount().ne(vals[0])
df = df1.mask(mask & mask1).groupby(dates).bfill()
print (df)
             DateTime  Duration  Member Code
0 2018-09-01 00:01:00       3.0      13595.0
1 2018-09-01 00:35:00      14.0      17630.0
2 2018-09-01 01:14:00       9.0      13522.0
3 2018-09-01 01:46:00       7.5      14942.0
4 2018-09-01 01:46:00       7.5      14942.0
5 2018-09-02 00:13:00       1.0      17022.0
6 2018-09-02 00:51:00       2.0      13596.0
7 2018-09-02 02:16:00      10.0      14942.0
8 2018-09-02 02:16:00      10.0      14942.0
9 2018-09-02 02:16:00      10.0      14942.0

df_new = df.groupby(df1["DateTime"].dt.date).nth(vals)
print (df_new)
                      DateTime  Duration  Member Code
DateTime                                             
2018-09-01 2018-09-01 00:35:00      14.0      17630.0
2018-09-01 2018-09-01 01:14:00       9.0      13522.0
2018-09-02 2018-09-02 00:51:00       2.0      13596.0
2018-09-02 2018-09-02 02:16:00      10.0      14942.0

Upvotes: 1

Related Questions