Dimo
Dimo

Reputation: 116

Create columns based on word matches from seprate dataframe - works but very slow

I have two dataframes:

df = list of strings:

Strings
text xxx text
text yyy text
text www text
text eee text
text xxx yyyy text

tag_matrix_df = matrix of columns, labels and search words

Columns Search Labels
Col1 xxx label1
Col1 yyy label2
Col2 www label3
Col2 eee label3
Col2 xxx yyyy label4

I need to add columns to df based on tag_matrix_df when search words are found within the strings from df, for example: df =

Words Col1 Col2
text xxx text label1
text yyy text label2
text www text label3
text eee text label3
text xxx yyyy text label4

I have the following code, which seems to be working ok but it's very slow (tag_matrix_df > 5000 rows).

#df = Dataframe of words
#tag_matrix_df = datafreame of columns, labels and search strings
....

 def assign_label(kw, matrix_df):

    kw = kw.lower().strip()
    
    for ii, tag in matrix_df.iterrows():
      find_tag = tag['Find'].lower().strip()
      if kw != None and find_tag in kw.split():
        return tag['Label']

 flag_cols = tag_matrix_df['Flag Name'].unique()
 for flag in flag_cols:
   filtered_matrix_df = tag_matrix_df.loc[tag_matrix_df['Flag Name'] == flag]
   df[flag] = df.apply(lambda row: assign_label(row[0], filtered_matrix_df), axis=1)

Please also note the issue with the first and last row for tag_matrix_df where we have "xxx" within "xxx yyyy". In this instance, it should match to xxx yyyy as this is the word.

Any suggestions on a smarter implementation of this?

Upvotes: 0

Views: 54

Answers (1)

Corralien
Corralien

Reputation: 120479

Use pd.merge:

df = pd.merge(df, tag_matrix_df, left_on='Words', right_on='Search', how='left') 
df = df.pivot(index='Words', columns='Columns', values='Labels').fillna('')
>>> df
Columns     Col1    Col2
Words
eee               label3
www               label3
xxx       label1
xxx yyyy          label4
yyy       label2

Upvotes: 0

Related Questions