Mountain
Mountain

Reputation: 33

Find the certain value at each row data and count the frequency pandas

I want to calculate the frequency by each row data. For instance,

column_nameA column_nameB column_nameC title content
AAA company AAA Ben Simons AAA company has new product lanuch. AAA company has released new product. AAA claims that the product X has significant changed than before. Ben Simons, who is AAA company CEO, also mentioned.......
BBB company BBB Alex Wong AAA company has new product lanuch. AAA company has released new product. BBB claims that the product X has significant changed than before, and BBB company has invested around 1 millions…....

In here, the result I expected is When AAA company happens in the title and counts 1, if AAA company appears twice in the title then it should count as 2.

Also, the similar idea in the content, if AAA company appears once then count number shows 1, if AAA company appears twice in the title then it should count as 2.

However, if AAA company appears in the second row which the row only needs to consider BBB company or BBB instead AAA company or AAA.

So, the result would be like:

nameA_appear_in_title nameB_appear_in_title nameC_appear_in_title nameA_appear_in_content nameB_appear_in_content nameC_appear_in_content
1 1 0 2 1 1
0 0 0 1 1 0

All the data has stored into the dataframe, and hope this can manipulate by using panda.

One more thing would be highlighted, the title or content cannot be tokenized to count the frequency.

Upvotes: 0

Views: 27

Answers (1)

jezrael
jezrael

Reputation: 862501

Use itertools.product for all combinations of lists of columns names and create new columns with count, last remove original columns names if necessary:

cols = df.columns

L1 = ['column_nameA', 'column_nameB', 'column_nameC']
L2 = ['title', 'content']

from  itertools import product

for a, b in product(L2, L1):
    df[f'{b}_{a}'] = df.apply(lambda x: x[a].count(x[b]), axis=1)

df  = df.drop(cols, axis=1)
print (df)
   column_nameA_title  column_nameB_title  column_nameC_title  \
0                   1                   1                   0   
1                   0                   0                   0   

   column_nameA_content  column_nameB_content  column_nameC_content  
0                     2                     3                     1  
1                     1                     2                     0 

Last if necessary subtract column_nameA from column_nameB use:

cola = df.columns.str.startswith('column_nameA')
colb = df.columns.str.startswith('column_nameB')

df.loc[:, colb] = df.loc[:, colb] - df.loc[:, cola].to_numpy()
print (df)
   column_nameA_title  column_nameB_title  column_nameC_title  \
0                   1                   0                   0   
1                   0                   0                   0   

   column_nameA_content  column_nameB_content  column_nameC_content  
0                     2                     1                     1  
1                     1                     1                     0  

Upvotes: 1

Related Questions