Reputation: 761
I'm in trouble with a pandas issue. I have a dataframe which contains the following information:
Timestamp User_id page_id
01-01-2005 1 2
01-01-2005 1 3
01-01-2005 1 4
01-02-2006 2 3
01-03-2005 3 2
01-03-2005 3 1
01-04-2006 2 3
01-04-2006 2 1
01-04-2006 1 5
01-04-2006 1 3
01-01-2005 4 2
01-05-2006 2 3
01-08-2005 5 2
01-07-2006 6 3
01-01-2005 1 3
01-02-2006 2 3
01-04-2005 7 2
01-09-2006 2 3
01-10-2005 1 2
01-12-2006 3 3
01-01-2005 4 2
01-02-2006 5 3
01-01-2005 6 2
01-02-2006 2 1
So I want to find out, how many unique page_ids has each user_id edited for each month. This means i must group by timestamp and user_id, and then apply the following lambda expression to get rid of duplicated page_ids:
df = data.groupby([pd.Grouper(key ='timestamp', freq='MS'),'user_id']).apply(lambda x: x.drop_duplicates('page_id'))
And after that step, I'm stuck on how to get a dataframe that looks like the following:
Timestamp User_id page_id_count
01-01-2005 1 X
with X = number
of different page_id
that User_id = 1
edited in 01-01-2005
Any help would be great. Thank you very much in advance.
UPDATE:
this issue has been solved by an answer below, given by Jezrael.
Now, I need to get, for each page_id, how many different user_ids edited it in a month. The ouput would look like the following:
Timestamp page_id user_id_count
01-01-2005 2 Y
Being Y
the number of different user_id
s that edited the page_id = 2
on the date 01-01-2005
.
Thank you very much!
Upvotes: 3
Views: 494
Reputation: 862661
Use SeriesGroupBy.nunique
with Series.reset_index
:
#if necessary
#df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = (df.groupby([pd.Grouper(key ='Timestamp', freq='MS'),'User_id'])['page_id']
.nunique()
.reset_index(name='page_id_count'))
print (df)
Timestamp User_id page_id_count
0 2005-01-01 1 3
1 2005-01-01 3 2
2 2005-01-01 4 1
3 2005-01-01 5 1
4 2005-01-01 6 1
5 2005-01-01 7 1
6 2006-01-01 1 2
7 2006-01-01 2 2
8 2006-01-01 3 1
9 2006-01-01 5 1
10 2006-01-01 6 1
EDIT:
Swap values of columns only:
df = (df.groupby([pd.Grouper(key ='Timestamp', freq='MS'),'page_id'])['User_id']
.nunique()
.reset_index(name='User_id_count'))
print (df)
Upvotes: 2