Finding full form of parenthesized acronym using regex (easy)

Question

I am trying to find full-form of acronyms that have been specified in a text using parenthesis. Here is an example -

Aberrant DNA methylation, one of the major epigenetic alterations in cancer, has been reported to accumulate in a subset of colorectal cancer (CRC), so-called CpG island methylator phenotype (CIMP), which was known to correlate with micro satellite reduced instability (MSI)-high CRC

Here, I want to be able to form a list of short-form/full-form occurrences like -

CRC - Colorectal Cancer

CIMP - CpG island methylator phenotype

MSI - microsatellite instability .....

The thing is I have been able to find all parenthesized entities using re.findall('($.*?$)', s), but finding the corresponding full-form is proving difficult. Assuming all such full-forms are on the left of the parenthesis, I can use word boundaries to find say, the 4 words before the brackets. But to rather find the correct full-form of the acronym, I want to use the following two conditions -

the number of words be not more than 3+|SF| where |SF| is the number of characters in the short-form (micro satellite reduced instability (MSI), here the full-form has 4 words whereas the shortform has 3 characters)
The first word of the full-form start with the first character of the short-form (eg. colorectal cancer (crc))

With my current understanding of regex, I have not been able to write a regex that solves the above 2 conditions and finds all such cases in the text. Could you please give me some pointers for this ?

Laurel · Accepted Answer

As I said before, this may be inaccurate in some cases. You will likely need to proofread the results for accuracy.

I suggest using several regexes. Here are the steps you will need to take:

Get the acronyms. You're already doing this with your first regex.
Find how many letters in the acronym.
Construct and run this regex: ((?:\w+\W+){1, (acronym length +3) })$ acronym $. For example, ((?:\w+\W+){1,6})$CRC$.
This step gets all the words within range ("not more than 3+|SF|") of the parenthesized acronym.
Construct and run this regex on the words you got in group 1 in the previous step: \b (acronym first letter) .*. For example, \bC.* for CRC. You will want to use case insensitive matching here.
This finds the first word that starts with the given letter within the range; you may catch extra words before the acronym this way.

Note that I'm using the regex definition of "words", meaning that [a-zA-Z0-9_] is matched by \w, (unless you're in Unicode mode, when it matches [\p{L}\p{N}_]). You may want to change \w and \W (and also \b, if the first word can come directly after a hyphen).

Finding full form of parenthesized acronym using regex (easy)

Answers (1)

Related Questions