TK7372
TK7372

Reputation: 49

Mode of a dataframe with priority

I am new to pandas so please bear with me. I want to apply mode on a dataframe but with a tweak. there can be four values [High,Medium,Low,NA] for Columns Item_1, Item_2 and Item_3. High with highest priority and Low being lowest. NA should be ignored.
Data is as shown below.

    date       Type  Item_1   Item_2  Item_3 Price
0   2021-01-01  A    High     Low     Low     20   
1   2021-01-01  A    High     Low     Low     20    
2   2021-01-01  A    Low      High    Low     20   
3   2021-01-01  A    Medium   High    High    20    
4   2021-01-01  B    Low      High    Low     15   
5   2021-01-01  B    Medium   High    High    15
6   2021-01-01  B    Low      Low     Medium  15
6   2021-01-02  A    NA       High    NA      30    
7   2021-01-02  A    NA       High    NA      30    
8   2021-01-02  A    NA       NA      NA      30
9   2021-01-02  A    NA       NA      Low     30
10  2021-01-02  A    NA       NA      Low     30
11  2021-01-02  A    NA       Low     High    30

Expected Output:

    date       Type  Item_1   Item_2  Item_3  Price
0   2021-01-01  A    High     High    Low     20  
1   2021-01-01  B    Low      High    High    15
2   2021-01-02  A    NA       High    Low     30   

My use case is, As you can see for 1st row in Item_1 now has value High as its most frequent.
But in Item_2 Both Low and High have same frequency but High have more priority thus value is High in output.
In Item_3 Low is most frequent thus value Low.

Edit:
NA should be ignored. and second most availbe value should be as output as in case of date 2021-01-02 NA has highest frequency but second most frequent value is High.

The value NA does not correspond to NaN rather its a string object. similar to other categorical values. It's just that whenever there is even single High/Low/Medium for a date irrespective of the frequency of NA, I want to choose the categorical value which has second highest occurance/frequency

Upvotes: 1

Views: 444

Answers (2)

jezrael
jezrael

Reputation: 862911

I think there is multiple rows with different dates, so first reshape by DataFrame.melt, then convert values to ordered categoricals and last get first mode per groups in GroupBy.agg in lambda function and last reshape by Series.unstack:

cols = df.columns
df = df.melt(['date','Type','Price'])
df["value"] = pd.Categorical(df["value"], 
                             categories=["High", "Medium", "Low"], 
                             ordered=True)
df = (df.groupby(['date','Type','Price','variable'])['value']
        .agg(lambda x: x.mode().iat[0])
        .unstack()
        .rename_axis(None, axis=1)
        .reset_index()
        .reindex(cols, axis=1))
print (df)
         date Type  Item_1 Item_2 Item_3  Price
0  2021-01-01    A    High   High    Low     20
1  2021-01-01    B     Low   High   High     15
2  2021-01-02    A  Medium    Low   High     30

Upvotes: 1

elevendollar
elevendollar

Reputation: 1204

You could turn your Item columns into categorical and provide the ranking (your priority). For example for Item_2:

df["Item_2"] = pd.Categorical(df["Item_2"], ["High", "Medium", "Low"])

When you use mode it returns multiple values if more than 1 value has the highest frequency. But it will also order the results. By default in alphab etical order but because you turned your column into categorical it will use the ranking instead.

df["Item_2"].mode()

Will return

0    High
1     Low
Name: Item_2, dtype: category

Using [0] index you can always extract the top value.

df["Item_2"].mode()[0]

If you apply that to all columns you should get what you are looking for.

Upvotes: 1

Related Questions