user1995
user1995

Reputation: 538

Finding full form of parenthesized acronym using regex (easy)

I am trying to find full-form of acronyms that have been specified in a text using parenthesis. Here is an example -

Aberrant DNA methylation, one of the major epigenetic alterations in cancer, has been reported to accumulate in a subset of colorectal cancer (CRC), so-called CpG island methylator phenotype (CIMP), which was known to correlate with micro satellite reduced instability (MSI)-high CRC

Here, I want to be able to form a list of short-form/full-form occurrences like -

CRC - Colorectal Cancer

CIMP - CpG island methylator phenotype

MSI - microsatellite instability .....

The thing is I have been able to find all parenthesized entities using re.findall('(\(.*?\))', s), but finding the corresponding full-form is proving difficult. Assuming all such full-forms are on the left of the parenthesis, I can use word boundaries to find say, the 4 words before the brackets. But to rather find the correct full-form of the acronym, I want to use the following two conditions -

With my current understanding of regex, I have not been able to write a regex that solves the above 2 conditions and finds all such cases in the text. Could you please give me some pointers for this ?

Upvotes: 1

Views: 292

Answers (1)

Laurel
Laurel

Reputation: 6173

As I said before, this may be inaccurate in some cases. You will likely need to proofread the results for accuracy.

I suggest using several regexes. Here are the steps you will need to take:

  1. Get the acronyms. You're already doing this with your first regex.
  2. Find how many letters in the acronym.
  3. Construct and run this regex: ((?:\w+\W+){1, (acronym length +3) })\( acronym \). For example, ((?:\w+\W+){1,6})\(CRC\).
    This step gets all the words within range ("not more than 3+|SF|") of the parenthesized acronym.
  4. Construct and run this regex on the words you got in group 1 in the previous step: \b (acronym first letter) .*. For example, \bC.* for CRC. You will want to use case insensitive matching here.
    This finds the first word that starts with the given letter within the range; you may catch extra words before the acronym this way.

Note that I'm using the regex definition of "words", meaning that [a-zA-Z0-9_] is matched by \w, (unless you're in Unicode mode, when it matches [\p{L}\p{N}_]). You may want to change \w and \W (and also \b, if the first word can come directly after a hyphen).

Upvotes: 1

Related Questions