Reputation: 141
A pandas data frame of mostly structured data has 2 columns containing user input, text narratives. Some narratives are poorly written. I'm looking to extract keywords that occur in the same sentence within each narrative. The words are sometimes bigrams (fractured implant) but usually lots of non-keywords are in-between the keywords (implant was really fractured). They are only a pair if they occur in the same sentence within the narrative, and it's possible to have more than 2 keywords in a sentence. Here's an example, plus my attempt.
import pandas as pd
import nltk
def get_keywords(x, y):
tokens = nltk.tokenize.word_tokenize(x)
keywords = [keyword for keyword in tokens if keyword in y]
keywords_string = ', '.join(keywords)
return keywords_string
text = ['after investigation it was found that plate was fractured. It was a broken plate.
patient had fractured his femur. ',
'investigation took long. upon xray the plate, which looked ok at first suffered
breakage.',
'it happend that the screws had all broken', 'it was sad. fractured was the implant.',
'this sentance has nothing. as does this one. and this one too.',
'nothing happening here though a bone was fractured. bone was broke too as was screw.']
df = pd.DataFrame(text, columns = ['Text'])
## These are the key words. The pairs belong to separate lists--(items, modes) in
## either order. These lists tend to grow as more keywords are discovered.
items = ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']
modes = ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']
other = ['bone', 'femor', 'ulna' ]
# the apply(lambda) is slow but I don't mind it.
df['items'] = df['Text'].apply(lambda x: get_keywords(x, items))
df['F Modes'] = df['Text'].apply(lambda x: get_keywords(x, modes))
df['other'] = df['Text'].apply(lambda x: get_keywords(x, other))
### After using loc to isolate rows of interest, go back and grab whole
## sentence for review. It's shorter than reading everything. But this
## is what I'm hoping to reduce.
xxx = df['Text'].str.extractall(r"([^.]*?fracture[^.]*\.)").unstack()
This takes a lot of effort and iteration. Pulling sentences that have the keywords is less than reading everything, but it's still a lot of work. QUESTION: is it possible to look within each sentence and grab only words of interest, keep them in order, and place them as groups in a summary column. Drop all words in-between the keywords of interest. Indices have to be preserved because this text data will merge to another df on the indices.
The desired df would look like this:
text = [['after investigation it was found that plate was fractured. It was a broken plate.
patient had fractured his femur. ', 'plate fractured, broken plate, fracture femur'],
['investigation took long. upon xray the plate, which looked ok at first suffered
breakage.', 'plate breakage'],
['it happened that the screws had all broken', 'screws broken'],
['it was sad. fractured was the implant.', 'fractured implant'],
['this sentence has nothing. as does this one. and this one too.', ''],
['nothing happening here. though a bone was fractured. bone was broke too as was
screw.', 'bone fractured, bone broke screw']]
df = pd.DataFrame(text, columns = ['Text', 'Summary'])
df
Upvotes: 1
Views: 2315
Reputation: 26708
You could try tokenizing the text before extracting the keywords:
import pandas as pd
import nltk
import numpy as np
from more_itertools import split_after
nltk.download('punkt')
text = ['after investigation it was found that plate was fractured. It was a broken plate. patient had fractured his femur. ',
'investigation took long. upon xray the plate, which looked ok at first suffered breakage.',
'it happend that the screws had all broken', 'it was sad. fractured was the implant.',
'this sentance has nothing. as does this one. and this one too.',
'nothing happening here though a bone was fractured. bone was broke too as was screw.']
def tokenize(texts):
return [nltk.tokenize.word_tokenize(t) for t in texts]
Afterwards, you can extract the key words as a new column (here I am extracting the key words from each sentence):
def key_word_intersection(df):
summaries = []
for x in tokenize(df['Text'].to_numpy()):
keywords = np.concatenate([
np.intersect1d(x, ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']),
np.intersect1d(x, ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']),
np.intersect1d(x, ['bone', 'femur', 'ulna' ])])
dot_sep_sentences = np.array(list(split_after(x, lambda i: i == ".")), dtype=object)
summary = []
for i, s in enumerate(dot_sep_sentences):
summary.append([dot_sep_sentences[i][j] for j, keyword in enumerate(s) if keyword in keywords ])
summaries.append(', '.join([' '.join(x) for x in summary if x]))
return summaries
df = pd.DataFrame(text, columns = ['Text'])
df['Summary'] = key_word_intersection(df)
| | Text | Summary |
|---:|:--------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------|
| 0 | after investigation it was found that plate was fractured. It was a broken plate. patient had fractured his femur. | plate fractured, broken plate, fractured femur |
| 1 | investigation took long. upon xray the plate, which looked ok at first suffered breakage. | plate breakage |
| 2 | it happend that the screws had all broken | screws broken |
| 3 | it was sad. fractured was the implant. | fractured implant |
| 4 | this sentance has nothing. as does this one. and this one too. | |
| 5 | nothing happening here though a bone was fractured. bone was broke too as was screw. | bone fractured, bone broke screw |
If you do not want sentence-separated key words, but still want to main their order, you could just do:
def key_word_intersection(df):
summaries = []
for x in tokenize(df['Text'].to_numpy()):
keywords = np.concatenate([
np.intersect1d(x, ['implant', 'implants', 'plate', 'plates', 'screw', 'screws']),
np.intersect1d(x, ['broke', 'broken', 'break', 'breaks', 'breakage' , 'fracture', 'fractured']),
np.intersect1d(x, ['bone', 'femur', 'ulna' ])])
summaries.append(np.array(x)[[i for i, keyword in enumerate(x) if keyword in keywords]])
return summaries
df = pd.DataFrame(text, columns = ['Text'])
df['Summary'] = key_word_intersection(df)
| | Text | Summary |
|---:|:--------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------|
| 0 | after investigation it was found that plate was fractured. It was a broken plate. patient had fractured his femur. | ['plate' 'fractured' 'broken' 'plate' 'fractured' 'femur'] |
| 1 | investigation took long. upon xray the plate, which looked ok at first suffered breakage. | ['plate' 'breakage'] |
| 2 | it happend that the screws had all broken | ['screws' 'broken'] |
| 3 | it was sad. fractured was the implant. | ['fractured' 'implant'] |
| 4 | this sentance has nothing. as does this one. and this one too. | [] |
| 5 | nothing happening here though a bone was fractured. bone was broke too as was screw. | ['bone' 'fractured' 'bone' 'broke' 'screw'] |
Upvotes: 1