tanmay
tanmay

Reputation: 225

How to perform word level search in a python string

I have a pandas data frame with the below structure

 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   annotation            10237 non-null  object
 1   note_sentence         10237 non-null  object
 2   listofsentence        10237 non-null  object

The last column 'listofsentence' is a list initialised with 'O's same as the number of words in the note_sentence column. Now I want to match the individual strings present inside the annotation column with the long string in the 'note_sentence' column and wherever it gets the matching word I want to update the 'listofsentence' column value from 'O' to 'I'. For example, in the below sample record the value of 'listofsentence' should be updated to [O, I, I, I, O, I, I, I, O, O, O] from its default state.

I used the below code which is able to return the starting and ending indexes of the match but I want it at the word level.

def find_index(string, sentence):
  for match in re.finditer(string, sentence):
        print (match.start(), match.end())  

How can I do this?

enter image description here

Upvotes: 1

Views: 47

Answers (1)

Laurent
Laurent

Reputation: 13518

With the following dataframe:

import pandas as pd

df = pd.DataFrame(
    {
        "annotation": [["COMES AND GOES", "HAPPENED 5-6 TIMES"]],
        "note_sentence": ["IT COMES AND GOES, IT HAPPENED 5-6 TIMES SINCE IT STARTED"],
    }
)

You can define a helper function:

def match(annotation, note_sentence):
    note_sentence = note_sentence.replace(",", "")
    indices = []
    for item in annotation:
        if item in note_sentence:
            for word in item.split(" "):
                indices.append(note_sentence.split(" ").index(word))
    return [
        "O" if i not in indices else "I" for i in range(len(note_sentence.split(" ")))
    ]

And then:

df["list_of_sentence"] = df.apply(
    lambda x: match(x["annotation"], x["note_sentence"]), axis=1
)

print(df)
# Output
                             annotation  \
0  [COMES AND GOES, HAPPENED 5-6 TIMES]   

                                       note_sentence  \
0  IT COMES AND GOES, IT HAPPENED 5-6 TIMES SINCE...   

                    list_of_sentence  
0  [O, I, I, I, O, I, I, I, O, O, O] 

Upvotes: 1

Related Questions