tenebris silentio
tenebris silentio

Reputation: 519

Program to grab abbreviations and definitions - trouble getting all lower case abbreviations

I have a program that grabs abbreviations (i.e., looks for words enclosed in parentheses) and then based on the number of characters in the abbreviation, goes back that many words and defines it. So far, it works for definitions like with preceding words that start with capital letters or when most preceding words start with capital letters. For the latter, it skips lower case letters like "in" and goes to the next one. However, my problem is when the number of corresponding words are all lowercase.

Current Output:

All Awesome Dudes (AAD)
Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT)
Trials (IMMPACT). Some patient prefer the usual care (UC)

Desired Output:

All Awesome Dudes (AAD)
Initiative on Methods, Measurement, and Pain Assessment in Clinical Trials (IMMPACT)
usual care (UC)

import re

s = """Too many people, but not All Awesome Dudes (AAD) only care about the 
Initiative on Methods, Measurement, and Pain Assessment in Clinical 
Trials (IMMPACT). Some patient perfer the usual care (UC) approach of 
doing nothing"""
allabbre = []

for match in re.finditer(r"\((.*?)\)", s):
    start_index = match.start()
    abbr = match.group(1)
    size = len(abbr)
    words = s[:start_index].split()
    count=0
    for k,i in enumerate(words[::-1]):
      if i[0].isupper():count+=1
      if count==size:break
    words=words[-k-1:] 
    definition = " ".join(words)
    abbr_keywords = definition + " " + "(" + abbr + ")"
    pattern='[A-Z]'

    if re.search(pattern, abbr):
      if abbr_keywords not in allabbre:
          allabbre.append(abbr_keywords)
      print(abbr_keywords)

Upvotes: 1

Views: 52

Answers (1)

Smart Manoj
Smart Manoj

Reputation: 5843

The flag is used for rare cases like All are Awesome Dudes (AAD)

import re

s = """Too many people, but not All Awesome Dudes (AAD) only care about the 
Initiative on Methods, Measurement, and Pain Assessment in Clinical 
Trials (IMMPACT). Some patient perfer the usual care (UC) approach of 
doing nothing
"""
allabbre = []

for match in re.finditer(r"\((.*?)\)", s):
    start_index = match.start()
    abbr = match.group(1)
    size = len(abbr)
    words = s[:start_index].split()
    count=size-1
    flag=words[-1][0].isupper()
    for k,i in enumerate(words[::-1]):
        first_letter=i[0] if flag else i[0].upper()
        if first_letter==abbr[count]:count-=1
        if count==-1:break
    words=words[-k-1:] 
    definition = " ".join(words)
    abbr_keywords = definition + " " + "(" + abbr + ")"
    pattern='[A-Z]'

    if re.search(pattern, abbr):
      if abbr_keywords not in allabbre:
          allabbre.append(abbr_keywords)
      print(abbr_keywords)

Upvotes: 1

Related Questions