Reputation: 225
I have a pandas data frame with the below structure
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 annotation 10237 non-null object
1 note_sentence 10237 non-null object
2 listofsentence 10237 non-null object
The last column 'listofsentence' is a list initialised with 'O's same as the number of words in the note_sentence column. Now I want to match the individual strings present inside the annotation column with the long string in the 'note_sentence' column and wherever it gets the matching word I want to update the 'listofsentence' column value from 'O' to 'I'. For example, in the below sample record the value of 'listofsentence' should be updated to [O, I, I, I, O, I, I, I, O, O, O] from its default state.
I used the below code which is able to return the starting and ending indexes of the match but I want it at the word level.
def find_index(string, sentence):
for match in re.finditer(string, sentence):
print (match.start(), match.end())
How can I do this?
Upvotes: 1
Views: 47
Reputation: 13518
With the following dataframe:
import pandas as pd
df = pd.DataFrame(
{
"annotation": [["COMES AND GOES", "HAPPENED 5-6 TIMES"]],
"note_sentence": ["IT COMES AND GOES, IT HAPPENED 5-6 TIMES SINCE IT STARTED"],
}
)
You can define a helper function:
def match(annotation, note_sentence):
note_sentence = note_sentence.replace(",", "")
indices = []
for item in annotation:
if item in note_sentence:
for word in item.split(" "):
indices.append(note_sentence.split(" ").index(word))
return [
"O" if i not in indices else "I" for i in range(len(note_sentence.split(" ")))
]
And then:
df["list_of_sentence"] = df.apply(
lambda x: match(x["annotation"], x["note_sentence"]), axis=1
)
print(df)
# Output
annotation \
0 [COMES AND GOES, HAPPENED 5-6 TIMES]
note_sentence \
0 IT COMES AND GOES, IT HAPPENED 5-6 TIMES SINCE...
list_of_sentence
0 [O, I, I, I, O, I, I, I, O, O, O]
Upvotes: 1