Reputation: 359
I have a dataframe where the data looks like this.
pack pear carrot grape berry apple sum 3rd pack box
0 1 1.0 3.0 1.0 2.0 1.0 8 NaN 1.0
1 2 1.0 1.0 3.0 0.0 3.0 8 NaN 1.0
2 3 1.0 1.0 3.0 1.0 3.0 9 True 1.0
3 4 1.0 1.0 2.0 3.0 1.0 8 NaN 2.0
4 5 2.0 3.0 4.0 0.0 0.0 9 NaN 2.0
The 5 columns with the fruit names only have values between 0 and 6 and I'd like to have a data frame that shows the distribution of those individual values. It should like exactly like this
pear carrot grape berry apple
0.0 0.163636 0.145455 0.090909 0.218182 0.236364
1.0 0.472727 0.200000 0.218182 0.418182 0.454545
2.0 0.218182 0.309091 0.290909 0.254545 0.145455
3.0 0.090909 0.272727 0.236364 0.109091 0.127273
4.0 0.036364 0.072727 0.090909 NaN 0.036364
5.0 0.018182 NaN 0.036364 NaN NaN
6.0 NaN NaN 0.036364 NaN NaN
I did this by creating a series for each fruit and then merging the series together
pears_per_pack = df['pear'].value_counts(normalize=True)
carrots_per_pack = df['carrot'].value_counts(normalize=True)
grapes_per_pack = df['grape'].value_counts(normalize=True)
berrys_per_pack = df['berry'].value_counts(normalize=True)
apples_per_pack = df['apple'].value_counts(normalize=True)
df_lst = [pears_per_pack,carrots_per_pack,grapes_per_pack,berrys_per_pack,apples_per_pack]
fruit_df = pd.concat(df_lst, axis = 1)
Now, this was only 5 columns, fairly easy to do by hand, but this whole thing is more or less a learning opportunity for me, and I figure this must break DRY protocol. SO I'm asking if there is a better way to do something like this that is more proper/ would be more applicable if it were a greater number of columns I needed to do this with.
Upvotes: 2
Views: 171
Reputation: 18306
You can invoke pd.Series.value_counts
per column with apply
and then reindex
the result to 0..6
to assure every value in that range appears:
>>> cols = ["pear", "carrot", "grape", "berry", "apple"]
>>> val_counts = df[cols].apply(pd.Series.value_counts, normalize=True)
>>> val_counts
pear carrot grape berry apple
0.0 NaN NaN NaN 0.4 0.2
1.0 0.8 0.6 0.2 0.2 0.4
2.0 0.2 NaN 0.2 0.2 NaN
3.0 NaN 0.4 0.4 0.2 0.4
4.0 NaN NaN 0.2 NaN NaN
>>> result = val_counts.reindex(pd.RangeIndex(start=0, stop=6+1))
>>> result
pear carrot grape berry apple
0 NaN NaN NaN 0.4 0.2
1 0.8 0.6 0.2 0.2 0.4
2 0.2 NaN 0.2 0.2 NaN
3 NaN 0.4 0.4 0.2 0.4
4 NaN NaN 0.2 NaN NaN
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
cols
is columns to apply; it was written manually but can be chosen automatically dependingly. For example, for columns from 1st position to 6th or those columns that start with a certain name etc.
Upvotes: 3