CPDatascience
CPDatascience

Reputation: 103

Select the most frequent terms from a list of strings else return none?

I need to group by the column (in this case the column text) and create a list of all the possible strings in the column tag. Then, I need to find the most frequent term from the list of strings and if there is Not a frequent term or common term, the function must return "none".

I have a dataset that looks like this:

 Text             tag
 drink coke       yes
 eat pizza        mic
 eat fruits       yes
 eat banana       yes
 eat banana       mic
 eat fruits       mic
 eat pizza        no
 eat pizza        mic
 eat pizza        yes
 drink coke       yes
 drink coke       no
 drink coke       no
 drink coke       yes

I used the function below to create a list of all the tags and appended to a new column called labels, but I'm missing the last step. Select the most frequent term and if there is not a frequent term, return none.

 df = pd.DataFrame(df.groupby(['text'])['tag'].apply(lambda x: 
 list(x.values)))

I need to return this:

  Text           labels               final
  eat pizza      [mic,no,mic,yes]    mic
  eat fruits     [yes,mic]           none
  eat banana     [yes,mic]           none
  drink coke     [yes,yes,no,no,yes] yes
  

My output should be like the one in the column "final".

Upvotes: 1

Views: 79

Answers (3)

MAFiA303
MAFiA303

Reputation: 1317

here is pythonic way:

df.groupby(['Text'])['tag']\
.agg(lambda ser: ser.mode() if len(ser.mode()) == 1 else None)\
.reset_index()
Text tag
0 drink coke yes
1 eat banana None
2 eat fruits None
3 eat pizza mic

Upvotes: 0

jezrael
jezrael

Reputation: 863166

Use statistics.multimode and test if length is 1 else return None if performance is important:

from statistics import multimode

def f_unique(x):
    a = multimode(x)
    return a[0] if len(a) == 1 else None

df1 = (df.groupby('Text', as_index=False, sort=False)
         .agg(labels = ('tag', list), final = ('tag', f_unique)))
print (df1)
         Text                   labels final
0   eat pizza      [mic, no, mic, yes]   mic
1  eat fruits               [yes, mic]  None
2  eat banana               [yes, mic]  None
3  drink coke  [yes, no, no, yes, yes]   yes

Upvotes: 0

mozway
mozway

Reputation: 261860

You can use groupby.agg with a custom function for the most frequent item:

def unique_mode(s):
    m = s.mode()
    if len(m) == 1:
        return m.iloc[0]
    return None
    
out = (df
   .groupby('Text', as_index=False)
   .agg(**{'labels': ('tag', list),
           'final': ('tag', unique_mode),
          })
)

output:

         Text                   labels final
0  drink coke  [yes, yes, no, no, yes]   yes
1  eat banana               [yes, mic]  None
2  eat fruits               [yes, mic]  None
3   eat pizza      [mic, no, mic, yes]   mic

Upvotes: 2

Related Questions