jon
jon

Reputation: 359

New dataframe that is the result of value_counts on multiple columns in old dataframe

I have a dataframe where the data looks like this.

        pack  pear  carrot  grape  berry  apple  sum 3rd pack  box
0      1   1.0     3.0    1.0    2.0    1.0    8      NaN  1.0
1      2   1.0     1.0    3.0    0.0    3.0    8      NaN  1.0
2      3   1.0     1.0    3.0    1.0    3.0    9     True  1.0
3      4   1.0     1.0    2.0    3.0    1.0    8      NaN  2.0
4      5   2.0     3.0    4.0    0.0    0.0    9      NaN  2.0

The 5 columns with the fruit names only have values between 0 and 6 and I'd like to have a data frame that shows the distribution of those individual values. It should like exactly like this

         pear    carrot     grape     berry     apple
0.0  0.163636  0.145455  0.090909  0.218182  0.236364
1.0  0.472727  0.200000  0.218182  0.418182  0.454545
2.0  0.218182  0.309091  0.290909  0.254545  0.145455
3.0  0.090909  0.272727  0.236364  0.109091  0.127273
4.0  0.036364  0.072727  0.090909       NaN  0.036364
5.0  0.018182       NaN  0.036364       NaN       NaN
6.0       NaN       NaN  0.036364       NaN       NaN 

I did this by creating a series for each fruit and then merging the series together

pears_per_pack = df['pear'].value_counts(normalize=True)
carrots_per_pack = df['carrot'].value_counts(normalize=True)
grapes_per_pack = df['grape'].value_counts(normalize=True)
berrys_per_pack = df['berry'].value_counts(normalize=True)
apples_per_pack = df['apple'].value_counts(normalize=True)

df_lst = [pears_per_pack,carrots_per_pack,grapes_per_pack,berrys_per_pack,apples_per_pack]
fruit_df = pd.concat(df_lst, axis = 1)

Now, this was only 5 columns, fairly easy to do by hand, but this whole thing is more or less a learning opportunity for me, and I figure this must break DRY protocol. SO I'm asking if there is a better way to do something like this that is more proper/ would be more applicable if it were a greater number of columns I needed to do this with.

Upvotes: 2

Views: 171

Answers (1)

Mustafa Aydın
Mustafa Aydın

Reputation: 18306

You can invoke pd.Series.value_counts per column with apply and then reindex the result to 0..6 to assure every value in that range appears:

>>> cols = ["pear", "carrot", "grape", "berry", "apple"]

>>> val_counts = df[cols].apply(pd.Series.value_counts, normalize=True)
>>> val_counts

     pear  carrot  grape  berry  apple
0.0   NaN     NaN    NaN    0.4    0.2
1.0   0.8     0.6    0.2    0.2    0.4
2.0   0.2     NaN    0.2    0.2    NaN
3.0   NaN     0.4    0.4    0.2    0.4
4.0   NaN     NaN    0.2    NaN    NaN

>>> result = val_counts.reindex(pd.RangeIndex(start=0, stop=6+1))
>>> result

   pear  carrot  grape  berry  apple
0   NaN     NaN    NaN    0.4    0.2
1   0.8     0.6    0.2    0.2    0.4
2   0.2     NaN    0.2    0.2    NaN
3   NaN     0.4    0.4    0.2    0.4
4   NaN     NaN    0.2    NaN    NaN
5   NaN     NaN    NaN    NaN    NaN
6   NaN     NaN    NaN    NaN    NaN

cols is columns to apply; it was written manually but can be chosen automatically dependingly. For example, for columns from 1st position to 6th or those columns that start with a certain name etc.

Upvotes: 3

Related Questions