Reputation: 335
I have some data which look like that:
PMID- 19587274
OWN - NLM
DP - 2009 Jul 8
TI - Domain general mechanisms of perceptual decision making in human cortex.
PG - 8675-87
AB - To successfully interact with objects in the environment, sensory evidence must
be continuously acquired, interpreted, and used to guide appropriate motor
responses. For example, when driving, a red
AD - Perception and Cognition Laboratory, Department of Psychology, University of
California, San Diego, La Jolla, California 92093, USA.
PMID- 19583148
OWN - NLM
DP - 2009 Jun
TI - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic
amyloidosis.
PG - 482-6
AB - BACKGROUND: Amyloidosis represents a group of different diseases characterized by
extracellular accumulation of pathologic fibrillar proteins in various tissues
AD - Asklepios Hospital, Department of Medicine, Langen, Germany.
[email protected]
I want to write a regex which can match the sentences which follow PMID, TI and AB.
Is it possible to get these in a one shot regex?
I have spent nearly the whole day to try to figure out a regex and the closest I could get is that:
reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD'
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()
Which will return me the matches only in the second "set" of data, and not all of them.
Any idea? Thank you!
Upvotes: 0
Views: 3084
Reputation: 89171
If the order of the lines can change, you could use this pattern:
reg4 = re.compile(r"""
^
(?: PMID \s*-\s* (?P<pmid> [0-9]+ ) \n
| TI \s*-\s* (?P<title> .* (?:\n[^\S\n].*)* ) \n
| AB \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)* ) \n
| .+\n
)+
""", re.MULTILINE | re.VERBOSE)
It will match consecutive non-empty lines, and capture the PMID
, TI
and AB
items.
The item values can span multiple lines, as long as the lines following the first start with a whitespace character.
[^\S\n]
" matches any whitespace character (\s
), except newline (\n
)..* (?:\n[^\S\n].*)*
" matches consecutive lines that start with a whitespace character..+\n
" matches any other non-empty line.Upvotes: 0
Reputation: 45112
How about not using regexps for this task, but instead using programmatic code that splits by newlines, looks at prefix codes using .startswith() etc? The code would be longer that way but everyone would be able to understand it, without having to come to stackoverflow for help.
Upvotes: 2
Reputation: 21717
The problem were the greedy qualifiers. Here's a regex that is more specific, and non-greedy:
#!/usr/bin/python
import re
from pprint import pprint
data = open("testdata.txt").read()
reg4 = r'''
^PMID # Start matching at the string PMID
\s*?- # As little whitespace as possible up to the next '-'
\s*? # As little whitespcase as possible
(?P<pmid>[0-9]+) # Capture the field "pmid", accepting only numeric characters
.*?TI # next, match any character up to the first occurrence of 'TI'
\s*?- # as little whitespace as possible up to the next '-'
\s*? # as little whitespace as possible
(?P<title>.*?)PG # capture the field <title> accepting any character up the the next occurrence of 'PG'
.*?AB # match any character up to the following occurrence of 'AB'
\s*?- # As little whitespace as possible up to the next '-'
\s*? # As little whitespcase as possible
(?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD'
'''
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE):
print 78*"-"
pprint(i.groupdict())
Output:
------------------------------------------------------------------------------
{'abstract': ' To successfully interact with objects in the environment,
sensory evidence must\n be continuously acquired, interpreted, and
used to guide appropriate motor\n responses. For example, when
driving, a red \n',
'pmid': '19587274',
'title': ' Domain general mechanisms of perceptual decision making in
human cortex.\n'}
------------------------------------------------------------------------------
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different
diseases characterized by\n extracellular accumulation of pathologic
fibrillar proteins in various tissues\n',
'pmid': '19583148',
'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients
with hepatic\n amyloidosis.\n'}
You may want to strip
the whitespace of each field after scanning.
Upvotes: -1
Reputation: 43120
Another regex:
reg4 = r'(?<=PMID- )(?P<pmid>.*?)(?=OWN - ).*?(?<=TI - )(?P<title>.*?)(?=PG - ).*?(?<=AB - )(?P<abstract>.*?)(?=AD - )'
Upvotes: 0
Reputation: 72646
How about:
import re
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI - (?P<title>.*?)^PG|AB - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
print i.groupdict()
Output:
{'pmid': '19587274', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': None}
{'pmid': '19583148', 'abstract': None, 'title': None}
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}
Edit
As a verbose RE to make it more understandable (I think verbose REs should be used for anything but the simplest of expressions, but that's just my opinion!):
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
(?: # Non capturing group with multiple options, first option:
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
| # Next option:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
| # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
)
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
Note that you could replace the ^PG
and ^AD
with ^\S
to make it more general (you want to match everything up until the first non-space at the start of a line).
Edit 2
If you want to catch the whole thing in one regexp, get rid of the starting (?:
, the ending )
and change the |
characters to .*?
:
#!/usr/bin/python
import re
reg4 = re.compile(r'''
^ # Start of a line (due to re.MULTILINE, this may match at the start of any line)
PMID-\s # Literal "PMID-" followed by a space
(?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid'
.*? # Next part:
TI\s{2}-\s # "TI", two spaces, a hyphen and a space
(?P<title>.*?) # The title, a non greedy match that will capture everything up to...
^PG # The characters PG at the start of a line
.*? # Next option
AB\s{2}-\s # "AB - "
(?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
^AD # "AD" at the start of a line
''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
print i.groupdict()
This gives:
{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n be continuously acquired, interpreted, and used to guide appropriate motor\n responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'}
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n amyloidosis.\n'}
Upvotes: 2