Reputation: 33
I want to calculate the frequency by each row data. For instance,
column_nameA | column_nameB | column_nameC | title | content |
---|---|---|---|---|
AAA company | AAA | Ben Simons | AAA company has new product lanuch. | AAA company has released new product. AAA claims that the product X has significant changed than before. Ben Simons, who is AAA company CEO, also mentioned....... |
BBB company | BBB | Alex Wong | AAA company has new product lanuch. | AAA company has released new product. BBB claims that the product X has significant changed than before, and BBB company has invested around 1 millions….... |
In here, the result I expected is When AAA company happens in the title and counts 1, if AAA company appears twice in the title then it should count as 2.
Also, the similar idea in the content, if AAA company appears once then count number shows 1, if AAA company appears twice in the title then it should count as 2.
However, if AAA company appears in the second row which the row only needs to consider BBB company or BBB instead AAA company or AAA.
So, the result would be like:
nameA_appear_in_title | nameB_appear_in_title | nameC_appear_in_title | nameA_appear_in_content | nameB_appear_in_content | nameC_appear_in_content |
---|---|---|---|---|---|
1 | 1 | 0 | 2 | 1 | 1 |
0 | 0 | 0 | 1 | 1 | 0 |
All the data has stored into the dataframe, and hope this can manipulate by using panda.
One more thing would be highlighted, the title or content cannot be tokenized to count the frequency.
Upvotes: 0
Views: 27
Reputation: 862501
Use itertools.product
for all combinations of lists of columns names and create new columns with count
, last remove original columns names if necessary:
cols = df.columns
L1 = ['column_nameA', 'column_nameB', 'column_nameC']
L2 = ['title', 'content']
from itertools import product
for a, b in product(L2, L1):
df[f'{b}_{a}'] = df.apply(lambda x: x[a].count(x[b]), axis=1)
df = df.drop(cols, axis=1)
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 1 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 3 1
1 1 2 0
Last if necessary subtract column_nameA
from column_nameB
use:
cola = df.columns.str.startswith('column_nameA')
colb = df.columns.str.startswith('column_nameB')
df.loc[:, colb] = df.loc[:, colb] - df.loc[:, cola].to_numpy()
print (df)
column_nameA_title column_nameB_title column_nameC_title \
0 1 0 0
1 0 0 0
column_nameA_content column_nameB_content column_nameC_content
0 2 1 1
1 1 1 0
Upvotes: 1