Dervin Thunk
Dervin Thunk

Reputation: 20119

Number of distinct days in pandas

I have the following schema:

            H   T     datetime
    002599f58   A   2019-09-23
    002599f58   A   2019-09-23
    002599f58   A   2019-09-23
    002599f58   A   2019-09-26
    002599f58   C   2019-09-26
    005344fe8   C   2019-09-30
    005344fe8   D   2019-10-02
    005344fe8   D   2019-10-05

What I need to do is count the number of distinct days that H is in T. I've tried:

grouper = Grouper(key="datetime", freq="1D")
df = cdr.groupby(["H", "T", grouper])["datetime"].nunique().reset_index(name="numdays")

and others of the type, like size(). I now I have 14 days, but when I sort the values, they give me numbers greater than 14, so I know those are wrong, but I cannot tell why.

The expected output is a dataframe

H           T   numdays
002599f58   A   2 (23 & 26)
002599f58   C   1 (26)
005344fe8   C   1 (30)
005344fe8   D   2 (02 & 05) 

which tells me, for each different H and T, the number of distinct days that H is in T, where numdays cannot be larger than 14 (since that's all days I have). I don't need the data in parens, just there to show where the values come from.

Upvotes: 2

Views: 404

Answers (1)

jezrael
jezrael

Reputation: 862511

I believe you need to remove Grouper and for dates add Series.dt.date:

cdr['datetime'] = pd.to_datetime(cdr['datetime'])

cdr['dates'] = cdr['datetime'].dt.date
df = cdr.groupby(["H", "T"])["dates"].nunique().reset_index(name="numdays")
print (df)

           H  T  numdays
0  002599f58  A        2
1  002599f58  C        1
2  005344fe8  C        1
3  005344fe8  D        2

You can test solutions, old and new with convert output to lists:

print (cdr)
           H  T             datetime
0  002599f58  A  2019-09-23 12:45:10
1  002599f58  A  2019-09-23 02:45:10
2  002599f58  A  2019-09-23 12:45:50
3  002599f58  A  2019-09-26 12:25:10
4  002599f58  C  2019-09-26 10:45:10
5  005344fe8  C  2019-09-30 12:25:10
6  005344fe8  D  2019-10-02 12:45:00
7  005344fe8  D           2019-10-05

cdr['dates'] = cdr['datetime'].dt.date
print (cdr.groupby(["H", "T"])["dates"].apply(list))
H          T
002599f58  A    [2019-09-23, 2019-09-23, 2019-09-23, 2019-09-26]
           C                                        [2019-09-26]
005344fe8  C                                        [2019-09-30]
           D                            [2019-10-02, 2019-10-05]
Name: dates, dtype: object

grouper = pd.Grouper(key="datetime", freq="1D")
print (cdr.groupby(["H", "T", grouper])["datetime"].apply(list))
H          T  datetime  
002599f58  A  2019-09-23    [2019-09-23 12:45:10, 2019-09-23 02:45:10, 201...
              2019-09-26                                [2019-09-26 12:25:10]
           C  2019-09-26                                [2019-09-26 10:45:10]
005344fe8  C  2019-09-30                                [2019-09-30 12:25:10]
           D  2019-10-02                                [2019-10-02 12:45:00]
              2019-10-05                                [2019-10-05 00:00:00]
Name: datetime, dtype: object

Upvotes: 3

Related Questions