Reputation: 3086
I am working on information extraction from medical texts (very new to NLP!). At the moment I am interested to find and extract the medications which are mentioned in a predefined list of drugs. For example, consider the text:
"John was prescribed aspirin due to hight temperature"
Thus, given the list of medications (in Python language):
list_of_meds = ['aspirin', 'ibuprofen', 'paracetamol']
The extracted drug is aspirin
. That's fine.
Now consider another case:
"John was prescribed ibuprofen, because he could not tolerate paracetamol"
Now, if I extract the drugs using the list (for example with regular expression), then the extracted drugs are ibuprofen
and paracetamol
.
QUESTION How to separate actually prescribed and untolerated drugs? Is there a way to label prescribed (used) and other mentioned drugs?
Upvotes: 0
Views: 2332
Reputation: 1882
This is a complex problem. To capture the nuances around negation, you need to step into the world of dependency parsing and relationship extraction. Couple of paths you can take to add sophistication to your current approach and the add-on by @Jordan are:
Handling negation in relations is not a solved problem. The state of the art around this is usually associated with sentiment analysis. An introduction on using dependency parsing to identify and handle negation is available at this Stanford NLP Sentiment Analysis using RNN page
Upvotes: 3
Reputation: 442
A way to overcome this would be predefining what word comes before the medicine name. So in your case, this would mean checking to see if either "prescribed" or "not tolerate" comes before the medicine name.
This is what I have come up with. Just replace the variable text = first
with text = second
if you want to try the second piece of text.
import string
list_of_meds = ['aspirin', 'ibuprofen', 'paracetamol']
first = "John was prescribed aspirin due to high temperature"
second = "John was prescribed ibuprofen, because he could not tolerate
paracetamol"
text = first
for c in string.punctuation:
text = text.replace(c, "")
text = text.split(' ')
for i in text:
if i in list_of_meds:
index = text.index(i) - 1
if text[index] == "prescribed":
medicine = i
break
Good luck!
Jordan.
----- EDIT -----
Use the variable medicine
as the output, and you can use that variable from there.
Upvotes: 2