Reputation: 7
I've been trying to extract 3 and or more words after Diagnosis:
or diagnosis:
to no avail.
This is the code I've been trying:
'diagnosis: \s+((?:\w+(?:\s+|$)){2})'
prints empty.
I have managed to make this code work:
"Diagnosis: (\w+)",
"diagnosis: (\w+)",
which gives me the immediate word after Diagnosis:
or diagnosis:
.
How can I make it work for 3 or more words?
#@title Extract Diagnosis { form-width: "20%" }
def extract_Diagnosis(clinical_information):
PATTERNS = [
"diagnosis: (\w+).",
"Diagnosis: (\w+).",
]
for pattern in PATTERNS:
matches = re.findall(pattern, clinical_information)
if len(matches) > 0:
break
Diagnosis = ''.join([t for t in matches if t.isalpha()])
return Diagnosis
for index, text in enumerate(texts):
print(extract_Diagnosis(text))
print("#"*79, index)
what I'm looking for is 3 or more words that come after diagnosis: or Diagnosis: in 20 pdfs. I've already turned the pdf to text and extracted the paragraph which "diagnosis:" is in (clinical information).
Upvotes: 0
Views: 105
Reputation: 23089
Ok, a new answer that focuses more on the problems with your code than problems with your regular expression. So first of all, your regular expression needs to be tweaked just a little bit by removing the initial space character and changing 2
to 3
:
diagnosis:\s+((?:\w+(?:\s+|$)){3})
Your code has a number of issues. Here's a version of your code that kinda works, although it may not be doing exactly what you want:
import re
def extract_Diagnosis(clinical_information):
PATTERNS = [r"diagnosis:\s+((?:\w+(?:\s+|$)){3})"]
matches = []
for pattern in PATTERNS:
matches = re.findall(pattern, clinical_information)
if len(matches) > 0:
break
Diagnosis = ''.join([t for t in matches])
return Diagnosis
texts = ["diagnosis: a b c blah blah blah diagnosis: asdf asdf asdf x x x "]
for index, text in enumerate(texts):
print(extract_Diagnosis(text))
print("#"*79, index)
Result:
a b c asdf asdf asdf.
Here are the things I fixed with your code:
r
to the front of the string constant containing the regular expression. This specifies a "raw string" in Python. You need to either do this or double up your backslashes.if t.isalpha()
. Given your expression, this will always be False
because what you are matching will always contain spaces as well as word characters. I see no reason for this test anyway, since you know exactly what you're getting because what you get matched your regular expression.I hope this helps!
Upvotes: 2