helloB
helloB

Reputation: 3582

Pandas: calculate average number of rows per user per day

I have a Pandas DataFrame which has a DATE column and a userid column like so:

    userid      DATE
0   vDKE27PF6z  2016-05-26
1   z6XJewg04Y  2016-05-25
2   e5Fsyak59l  2016-05-25
3   7SaYpGRtMI  2016-05-25
4   KGmTH0XOMz  2016-05-25

There are many distinct DATE values and for each DATE a particular userid can appear twice. For each date, I'd like to calculate the number of rows that correspond to that DATE divided by the number of unique userid values that appear for a given DATE. What's the best way to do this? I've tried various permutations of groupby, but these don't go where I want to be.

Upvotes: 0

Views: 2327

Answers (4)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25199

Suppose you have

df
    userid  DATE
0   vDKE27PF6z  2016-05-26
1   z6XJewg04Y  2016-05-25
2   e5Fsyak59l  2016-05-25
3   7SaYpGRtMI  2016-05-25
4   KGmTH0XOMz  2016-05-25

Then you may proceed as follows:

df.groupby('DATE',as_index=False).agg(lambda x: 1/len(np.unique(x)))
          DATE  userid
0   2016-05-25  0.25
1   2016-05-26  1.00

Upvotes: 0

Nils Gudat
Nils Gudat

Reputation: 13800

You should be able to do something like this:

df = pd.DataFrame({"userid": ["vDKE27PF6z",  "vDKE27PF6z", "e5Fsyak59l", "7SaYpGRtMI", "KGmTH0XOMz"  ], "DATE" : ["2016-05-25","2016-05-25","2016-05-25","2016-05-25","2016-05-26"]})

dates = df.groupby("DATE").count() # Number of occurrences of each date
dates["uniques"] = df.groupby("DATE").userid.nunique() # Number of unique users per day
dates["ratio"] = dates.userid/dates.uniques # Ratio of the two

Upvotes: 4

grey_ranger
grey_ranger

Reputation: 1030

pd.groupby is the way to go. I think you're looking for the describe() function. It gives simple statistics about the DataFrame in question.

data = pd.DataFrame({'userid':['vDKE27PF6z', 'z6XJewg04Y', 
                               'e5Fsyak59l', '7SaYpGRtMI', 'KGmTH0XOMz'],
                     'DATE': ['2016-05-26', '2016-05-25', 
                              '2016-05-25', '2016-05-25', '2016-05-25']})

for ind, grp in data.groupby('DATE'):
    print(grp.describe())

The output is a pd.DataFrame object that has a count of the number of objects in that group, and the number of unique objects in that group. Sample output below:

>>>              DATE      userid
   count            4           4
   unique           1           4
   top     2016-05-25  7SaYpGRtMI
   freq             4           1

>>>              DATE      userid
   count            1           1
   unique           1           1
   top     2016-05-26  vDKE27PF6z
   freq             1           1

Notice for date '2016-05-25', there is one unique date (of course) and 4 unique user ids. More information about the describe() function can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

Upvotes: 0

Victor Chubukov
Victor Chubukov

Reputation: 1375

df.groupby('DATE')['userid'].agg(lambda x:x.nunique()*1.0/x.count())

in python3 would be fine without the *1.0

Upvotes: 2

Related Questions