HRDSL
HRDSL

Reputation: 761

Count the number of rows in subgroups in group by object

I'm in trouble with a pandas issue. I have a dataframe which contains the following information:

Timestamp     User_id     page_id

01-01-2005     1             2
01-01-2005     1             3
01-01-2005     1             4
01-02-2006     2             3
01-03-2005     3             2
01-03-2005     3             1
01-04-2006     2             3
01-04-2006     2             1
01-04-2006     1             5
01-04-2006     1             3
01-01-2005     4             2
01-05-2006     2             3
01-08-2005     5             2
01-07-2006     6             3
01-01-2005     1             3
01-02-2006     2             3
01-04-2005     7             2
01-09-2006     2             3
01-10-2005     1             2
01-12-2006     3             3
01-01-2005     4             2
01-02-2006     5             3
01-01-2005     6             2
01-02-2006     2             1

So I want to find out, how many unique page_ids has each user_id edited for each month. This means i must group by timestamp and user_id, and then apply the following lambda expression to get rid of duplicated page_ids:

df = data.groupby([pd.Grouper(key ='timestamp', freq='MS'),'user_id']).apply(lambda x: x.drop_duplicates('page_id'))

And after that step, I'm stuck on how to get a dataframe that looks like the following:

Timestamp   User_id     page_id_count

01-01-2005     1          X

with X = number of different page_id that User_id = 1 edited in 01-01-2005 Any help would be great. Thank you very much in advance.

UPDATE:

this issue has been solved by an answer below, given by Jezrael.

Now, I need to get, for each page_id, how many different user_ids edited it in a month. The ouput would look like the following:

Timestamp      page_id      user_id_count

01-01-2005        2              Y

Being Y the number of different user_ids that edited the page_id = 2 on the date 01-01-2005.

Thank you very much!

Upvotes: 3

Views: 494

Answers (1)

jezrael
jezrael

Reputation: 862661

Use SeriesGroupBy.nunique with Series.reset_index:

#if necessary
#df['Timestamp'] = pd.to_datetime(df['Timestamp'])

df = (df.groupby([pd.Grouper(key ='Timestamp', freq='MS'),'User_id'])['page_id']
       .nunique()
       .reset_index(name='page_id_count'))
print (df)
    Timestamp  User_id  page_id_count
0  2005-01-01        1              3
1  2005-01-01        3              2
2  2005-01-01        4              1
3  2005-01-01        5              1
4  2005-01-01        6              1
5  2005-01-01        7              1
6  2006-01-01        1              2
7  2006-01-01        2              2
8  2006-01-01        3              1
9  2006-01-01        5              1
10 2006-01-01        6              1

EDIT:

Swap values of columns only:

df = (df.groupby([pd.Grouper(key ='Timestamp', freq='MS'),'page_id'])['User_id']
       .nunique()
       .reset_index(name='User_id_count'))
print (df)

Upvotes: 2

Related Questions