Reputation: 519
I have a program that grabs abbreviations (i.e., looks for words enclosed in parentheses) and then based on the number of characters in the abbreviation, goes back that many words and defines it. So far, it works for definitions like with preceding words that start with capital letters or when most preceding words start with capital letters. For the latter, it skips lower case letters like "in" and goes to the next one. However, my problem is when the number of corresponding words are all lowercase.
Current Output:
All Awesome Dudes (AAD)
Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT)
Trials (IMMPACT). Some patient prefer the usual care (UC)
Desired Output:
All Awesome Dudes (AAD)
Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT)
usual care (UC)
import re
s = """Too many people, but not All Awesome Dudes (AAD) only care about the
Initiative on Methods, Measurement, and Pain Assessment in Clinical
Trials (IMMPACT). Some patient perfer the usual care (UC) approach of
doing nothing"""
allabbre = []
for match in re.finditer(r"\((.*?)\)", s):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = s[:start_index].split()
count=0
for k,i in enumerate(words[::-1]):
if i[0].isupper():count+=1
if count==size:break
words=words[-k-1:]
definition = " ".join(words)
abbr_keywords = definition + " " + "(" + abbr + ")"
pattern='[A-Z]'
if re.search(pattern, abbr):
if abbr_keywords not in allabbre:
allabbre.append(abbr_keywords)
print(abbr_keywords)
Upvotes: 1
Views: 52
Reputation: 5843
The flag is used for rare cases like All are Awesome Dudes (AAD)
import re
s = """Too many people, but not All Awesome Dudes (AAD) only care about the
Initiative on Methods, Measurement, and Pain Assessment in Clinical
Trials (IMMPACT). Some patient perfer the usual care (UC) approach of
doing nothing
"""
allabbre = []
for match in re.finditer(r"\((.*?)\)", s):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = s[:start_index].split()
count=size-1
flag=words[-1][0].isupper()
for k,i in enumerate(words[::-1]):
first_letter=i[0] if flag else i[0].upper()
if first_letter==abbr[count]:count-=1
if count==-1:break
words=words[-k-1:]
definition = " ".join(words)
abbr_keywords = definition + " " + "(" + abbr + ")"
pattern='[A-Z]'
if re.search(pattern, abbr):
if abbr_keywords not in allabbre:
allabbre.append(abbr_keywords)
print(abbr_keywords)
Upvotes: 1