Reputation: 1033
I have just moved from R to Python and have an issue regarding groupby. I have a dataframe with three features as shown below:
date Scaled Name
3 2018-10-01 02:00:00 14.57 19245
4 2018-10-01 02:00:00 11.90 7245
5 2018-10-01 02:00:00 15.84 25245
6 2018-10-01 03:00:00 16.98 25245
7 2018-10-01 03:00:00 11.40 7245
8 2018-10-01 03:00:00 16.95 19245
9 2018-10-01 04:00:00 17.78 25245
10 2018-10-01 04:00:00 12.06 7245
11 2018-10-01 04:00:00 18.19 19245
12 2018-10-01 05:00:00 19.63 25245
I have around 80 unique names in the dataset and hence duplicate dates. I would like to create a new column in the data set which is a percentage unique to each Name showing the proportion of hours for that particular Name compared to the total range of hours in the dataset. I can easily make this calculation, but I am struggling with generating the new column. The calculation would be something like this
hours = ((df['date'].max(axis=0) - df['date'].min(axis=0)).total_seconds())/3600
df['percentage'] = df['Name'].value_counts()/ hours
Upvotes: 1
Views: 54
Reputation: 862431
You are close, only add Series.map
:
df['percentage'] = df['Name'].map(df['Name'].value_counts())/ hours
Or use GroupBy.transform
with GroupBy.size
:
df['percentage'] = df.groupby('Name')['Name'].transform('size')/ hours
print (df)
date Scaled Name percentage
0 2018-10-01 02:00:00 14.57 19245 1.000000
1 2018-10-01 02:00:00 11.90 7245 1.000000
2 2018-10-01 02:00:00 15.84 25245 1.333333
3 2018-10-01 03:00:00 16.98 25245 1.333333
4 2018-10-01 03:00:00 11.40 7245 1.000000
5 2018-10-01 03:00:00 16.95 19245 1.000000
6 2018-10-01 04:00:00 17.78 25245 1.333333
7 2018-10-01 04:00:00 12.06 7245 1.000000
8 2018-10-01 04:00:00 18.19 19245 1.000000
9 2018-10-01 05:00:00 19.63 25245 1.333333
Upvotes: 5