Reputation: 630
I have a number of medical reports from each which i am trying to capture 6 groups (groups 5 and 6 are optional):
(clinical details | clinical indication) + (text1) + (result|report) + (text2) + (interpretation|conclusion) + (text3).
The regex I am using is:
reportPat=re.compile(r'(Clinical details|indication)(.*?)(result|description|report)(.*?)(Interpretation|conclusion)(.*)',re.IGNORECASE|re.DOTALL)
works except on strings missing the optional groups on whom it fails.i have tried putting a question mark after group5 like so: (Interpretation|conclusion)?(.*) but then this group gets merged into group4. I am pasting two conflicting strings (one containing group 5/6 and the other without it) for people to test their regex. thanks for helping
text 1 (all groups present)
Technical Report:\nAdministrations:\n1.04 ml of Fluorine 18, fluorodeoxyglucose with aco - Bronchus and lung\nJA - Staging\n\nClinical Details:\nSquamous cell lung cancer, histology confirmed ?stage\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion. \n\nThere is a large mass noted in the left upper lobe proximally, with lower grade uptake within a collapsed left upper lobe. This lesi\n\nInterpretation: \nThe scan findings are in keeping with the known lung primary in the left upper lobe and involvement of the lymph nodes as dThere is no evidence of distant metastatic disease.
text 2 (without group 5 and 6)
Technical Report:\nAdministrations:\n0.81 ml of Fluorine 18, fluorodeoxyglucose with activity 312.79\nScanner: 3D Static\nPatient Position: Supine, Head First. Arms up\n\n\nDiagnosis Codes:\n- Bronchus and lung\nJA - Staging\n\nClinical Indication:\nNewly diagnosed primary lung cancer with cranial metastasis. PET scan to assess any further metastatic disease.\n\nScanner DST 3D\n\nSession 1 - \n\n.\n\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion.\n\nThere is increased FDG uptake in the right lower lobe mass abutting the medial and posterior pleura with central necrosis (maximum SUV 18.2). small nodule at the right paracolic gutte
Upvotes: 2
Views: 157
Reputation: 3127
It seems like that what you were missing is basically an end of pattern match to fool the greedy matches when combining with the optional presence of the group 5 & 6. This regexp should do the trick, maintaining your current group numbering:
reportPat=re.compile(
r'(Clinical details|indication)(.*)'
r'(result|description|report)(.*?)'
r'(?:(Interpretation|conclusion)(.*))?$',
re.IGNORECASE|re.DOTALL)
Changes done are adding the $
to the end, and enclosing the two last groups in a optional non-capturing group, (?: ... )?
. Also note how you easily can make the entire regexp more readable by splitting the lines (which the interpreter will autoconnect when compiling).
Added: When reviewing the result of the matches I saw some :\n
or : \n
, which can easily be cleaned up by adding (?:[:\s]*)?
inbetween the header and text groups. This is an optional non-capturing group of colons and whitespace. Your regexp does then look like this:
reportPat=re.compile(
r'(Clinical details|indication)(?:[:\s]*)?(.*)'
r'(result|description|report)(?:[:\s]*)?(.*?)'
r'(?:(Interpretation|conclusion)(?:[:\s]*)?(.*))?$',
re.IGNORECASE|re.DOTALL)
Added 2: At this link: https://regex101.com/r/gU9eV7/3, you can see the regex in action. I've also added some unit test cases to verify that it works against both texts, and that in for text1 it has a match for text1, and that for text2 it has nothing. I used this parallell to direct editing in a python script to verify my answer.
Upvotes: 1
Reputation: 2549
The following pattern works for both your test cases though given the format of the data you're having to parse I wouldn't be confident that the pattern will work for all cases (for example I've added :
after each of the keyword matches to try to prevent inadvertent matches against more common words like result
or description
):
re.compile(
r'(Clinical details|indication):(.+?)(result|description|report):(.+?)((Interpretation|conclusion):(.+?)){0,1}\Z',
re.IGNORECASE|re.DOTALL
)
I grouped the last 2 groups and marked them as optional using {0,1}
. This means the output groups will vary a little from your original pattern (you'll have an extra group, the 4th group will now contain the output of both the last 2 groups and the data for the last 2 groups will be in groups 5 and 6).
Upvotes: 0