user8376557
user8376557

Reputation:

Finding most frequent value from dataframe rows in Pandas

In a data frame, I want to create another column which is outputs the most frequent value coming from different columns in a row.

A    B    C   D
foo  bar  baz foo
egg  bacon egg egg
bacon egg foo  baz

The "E" column must output frequent value from a row like

E
foo
egg

How can I do it in Python?

Upvotes: 2

Views: 1581

Answers (1)

Oddaspa
Oddaspa

Reputation: 888

Recreating your problem with:

df = pd.DataFrame(
    {
        'A' : ['foo', 'egg', 'bacon'], 
        'B' : ['bar', 'bacon', 'egg'],
        'C' : ['baz', 'egg', 'foo'],
        'D' : ['foo', 'egg', 'baz']
    }
)

And solving the problem with

df['E'] = df.mode(axis=1)[0]

Output:

    A      B       C       D       E
0   foo    bar     baz     foo     foo
1   egg    bacon   egg     egg     egg
2   bacon  egg     foo     baz     bacon

What happens if there is no single most frequent element?

df.mode(axis=1)
    0      1       2       3
0   foo    NaN     NaN     NaN
1   egg    NaN     NaN     NaN
2   bacon  baz     egg     foo

As you can see when there is a tie on being most frequent it returns the values in the most frequent set. If I swap the values foo for egg and baz for bacon in columns C and D, respectively, we get the following result:

    0      1
0   foo    NaN
1   egg    NaN
2   bacon  egg

As you can see, now the result set is only two elements, which means that the tie is between bacon and egg.

How do I detect ties?

Let us work with the dataset not containing the column D.

df
    A      B       C
0   foo    bar     baz
1   egg    bacon   egg
2   bacon  egg     foo

df_m = df.mode(axis=1)
df_m
    0      1    2
0   bar    baz  foo
1   egg    NaN  NaN
2   bacon  egg  foo

df['D'] = df_m[0]
    A      B       C    D
0   foo    bar     baz  bar
1   egg    bacon   egg  egg
2   bacon  egg     foo  bacon

We can utilize the notna() method which pandas provide to create a mask to check which rows are not containing a NaN value, i.e which rows are in a tie.

First, we must drop the first column which always has a value.

df_m = df_m.drop(columns=0)

Then we need to transform the dataframe using another method .T, and check for any rows not containing NaNs.

df_mask = df_m.T.notna().any()
df_mask
0    False
1    False
2     True
dtype: bool

Now we have a pandas series of booleans. We can use this mask to overwrite the column from before.

df['D'][df_mask] = df['A'][df_mask] 
    A      B       C    D
0   foo    bar     baz  foo
1   egg    bacon   egg  egg
2   bacon  egg     foo  bacon

Upvotes: 3

Related Questions