Reputation: 538
I am trying to find full-form of acronyms that have been specified in a text using parenthesis. Here is an example -
Aberrant DNA methylation, one of the major epigenetic alterations in cancer, has been reported to accumulate in a subset of colorectal cancer (CRC), so-called CpG island methylator phenotype (CIMP), which was known to correlate with micro satellite reduced instability (MSI)-high CRC
Here, I want to be able to form a list of short-form/full-form occurrences like -
CRC - Colorectal Cancer
CIMP - CpG island methylator phenotype
MSI - microsatellite instability .....
The thing is I have been able to find all parenthesized entities using re.findall('(\(.*?\))', s)
, but finding the corresponding full-form is proving difficult. Assuming all such full-forms are on the left of the parenthesis, I can use word boundaries to find say, the 4 words before the brackets. But to rather find the correct full-form of the acronym, I want to use the following two conditions -
With my current understanding of regex, I have not been able to write a regex that solves the above 2 conditions and finds all such cases in the text. Could you please give me some pointers for this ?
Upvotes: 1
Views: 292
Reputation: 6173
As I said before, this may be inaccurate in some cases. You will likely need to proofread the results for accuracy.
I suggest using several regexes. Here are the steps you will need to take:
((?:\w+\W+){1,
(acronym length +3) })\(
acronym \)
. For example, ((?:\w+\W+){1,6})\(CRC\)
.\b
(acronym first letter) .*
. For example, \bC.*
for CRC. You will want to use case insensitive matching here.Note that I'm using the regex definition of "words", meaning that [a-zA-Z0-9_]
is matched by \w
, (unless you're in Unicode mode, when it matches [\p{L}\p{N}_]
). You may want to change \w
and \W
(and also \b
, if the first word can come directly after a hyphen).
Upvotes: 1