Reputation: 579
I have text stored in a Dataframe which contains many sentences. I have written a separate function where I look for certain keywords and values in a sentence and want to be able to store those values in a different column of the same Dataframe. I am having a problem when I iterate over rows of Dataframe to tokenize into each sentence first.
This works when I pass explicit sentences to the function. My problem is when I try to tokenize the text into sentences inside the loop. I get empty result in rf["Nod_size"]. However, "2.9x1.7" and "2.5x1.3" is my expected result.
This is the code I am using
import pandas as pd
import numpy as np
import nltk
import re
from nltk.tokenize import TweetTokenizer, sent_tokenize, word_tokenize
rf = pd.DataFrame([{"Text": "CHEST CA lung. -Increased sizes of nodules in RLL. There is further increased size and solid component of part-solid nodule associated with internal bubbly lucency and pleural tagging at apicoposterior segment of the LUL (SE 3; IM 38-50), now measuring about 2.9x1.7 cm in greatest transaxial dimension (previously size 2.5x1.3 cm in 2015).", "Stage": "T2aN2M0"},
{"Text": "CHEST CA lung. Post LL lobectomy. As compared to study obtained on 30/10/2018, -Top normal heart size. -Increased sizes of nodules in RLL.", "Stage": "T2aN2M0"}])
nodule_keywords = ["nodules","nodule"]
nodule_length_keyword = ["cm","mm", "centimeters", "milimeters"]
def GetNodule(sentence):
sentence = re.sub('-', ' ', sentence)
token_words = nltk.word_tokenize(sentence)
df = pd.DataFrame(token_words)
df['check_nodkeywords'] = df[0].str.lower().isin(nodule_keywords)
df['check_nod_len_keywords'] =
df[0].str.lower().isin(nodule_length_keyword)
check = np.any(df['check_nodkeywords']==True)
check1 =np.any(df['check_nod_len_keywords']==True)
if ((check==True)&(check1==True)):
position = np.where(df['check_nod_len_keywords']==True)
position = position[0]
nodule_size = df[0].iloc[position-1]
return nodule_size
for sub_list in rf['Text']:
sent = sent_tokenize(str(sub_list))
for sub_sent_list in sent:
result_calcified_nod = GetNodule(sub_sent_list)
rf["Nod_size"] = result_calcified_nod
Please Help!! I believe this is a conceptual problem rather than programming. Please help me to solve!
Upvotes: 0
Views: 539
Reputation: 1907
Below code should meet your requirement.
rf["Nod_size"] = ""
for i,sub_list in zip(range(len(rf)),rf['Text']):
temp = []
for sentence in sent_tokenize(sub_list):
result_calcified_nod = GetNodule(sentence)
temp.append(result_calcified_nod)
rf.loc[i]["Nod_size"] = temp
Upvotes: 1