Reputation: 49
I am new to pandas so please bear with me. I want to apply mode on a dataframe but with a tweak. there can be four values [High,Medium,Low,NA] for Columns Item_1, Item_2 and Item_3. High with highest priority and Low being lowest. NA should be ignored.
Data is as shown below.
date Type Item_1 Item_2 Item_3 Price
0 2021-01-01 A High Low Low 20
1 2021-01-01 A High Low Low 20
2 2021-01-01 A Low High Low 20
3 2021-01-01 A Medium High High 20
4 2021-01-01 B Low High Low 15
5 2021-01-01 B Medium High High 15
6 2021-01-01 B Low Low Medium 15
6 2021-01-02 A NA High NA 30
7 2021-01-02 A NA High NA 30
8 2021-01-02 A NA NA NA 30
9 2021-01-02 A NA NA Low 30
10 2021-01-02 A NA NA Low 30
11 2021-01-02 A NA Low High 30
Expected Output:
date Type Item_1 Item_2 Item_3 Price
0 2021-01-01 A High High Low 20
1 2021-01-01 B Low High High 15
2 2021-01-02 A NA High Low 30
My use case is, As you can see for 1st row in Item_1 now has value High as its most frequent.
But in Item_2 Both Low and High have same frequency but High have more priority thus value is High in output.
In Item_3 Low is most frequent thus value Low.
Edit:
NA should be ignored. and second most availbe value should be as output as in case of date 2021-01-02 NA has highest frequency but second most frequent value is High.
The value NA does not correspond to NaN rather its a string object. similar to other categorical values. It's just that whenever there is even single High/Low/Medium for a date irrespective of the frequency of NA, I want to choose the categorical value which has second highest occurance/frequency
Upvotes: 1
Views: 444
Reputation: 862911
I think there is multiple rows with different dates, so first reshape by DataFrame.melt
, then convert values to ordered categorical
s and last get first mode per groups in GroupBy.agg
in lambda function and last reshape by Series.unstack
:
cols = df.columns
df = df.melt(['date','Type','Price'])
df["value"] = pd.Categorical(df["value"],
categories=["High", "Medium", "Low"],
ordered=True)
df = (df.groupby(['date','Type','Price','variable'])['value']
.agg(lambda x: x.mode().iat[0])
.unstack()
.rename_axis(None, axis=1)
.reset_index()
.reindex(cols, axis=1))
print (df)
date Type Item_1 Item_2 Item_3 Price
0 2021-01-01 A High High Low 20
1 2021-01-01 B Low High High 15
2 2021-01-02 A Medium Low High 30
Upvotes: 1
Reputation: 1204
You could turn your Item columns into categorical and provide the ranking (your priority). For example for Item_2
:
df["Item_2"] = pd.Categorical(df["Item_2"], ["High", "Medium", "Low"])
When you use mode it returns multiple values if more than 1 value has the highest frequency. But it will also order the results. By default in alphab etical order but because you turned your column into categorical it will use the ranking instead.
df["Item_2"].mode()
Will return
0 High
1 Low
Name: Item_2, dtype: category
Using [0]
index you can always extract the top value.
df["Item_2"].mode()[0]
If you apply that to all columns you should get what you are looking for.
Upvotes: 1