Reputation: 10383
Haven't found a Q/A on SO that quite answers this situation. I have implemented solutions from some to get as far as I have.
I'm parsing the header (metadata) part of VCF files. Each line has the format:
##TAG=<key=val,key=val,...>
I have a regex that parses the multiple k-v pairs inside the <>
, but I can't seem to add in the <>
and have it still "work."
s = 'a=1,b=two,c="three"'
pat = re.compile(r'''(?P<key>\w+)=(?P<value>[^,]*),?''')
match = pat.findall(s)
print(dict(match))
#{'a': '1', 'b': 'two', 'c': '"three"'}
Also,
s = 'a=1,b=two,c="three"'
pat = re.compile(r'''(?:(?P<key>\w+)=(?P<value>[^,]*),?)''')
match = pat.findall(s)
print(match)
print(dict(match))
#[('a', '1'), ('b', 'two'), ('c', '"three"')]
#{'a': '1', 'b': 'two', 'c': '"three"'}
So, I thought I might be able to do:
s = '<a=1,b=two,c="three">'
pat = re.compile(r'''<(?:(?P<key>\w+)=(?P<value>[^,]*),?)>''')
match = pat.findall(s)
print(match)
print(dict(match))
#[]
#{}
If possible, I'd really like to do something like:
\#\#(?P<tag>)=<(?:(?P<key>\w+)=(?P<value>[^,]*),?)>
and capture the TAG and all of the k-v pairs. And obviously, I'd like it to "work."
I realize the "proper" solution here is likely to use a parser rather than regex. But I'm a bioinformatics person, not a programmer. And the format is very consistent and laid out in a standardized specification that is (almost) always followed.
Upvotes: 2
Views: 57
Reputation: 18611
With PyPi regex:
import regex
s = '##TAG=<key=val,key2=val2>'
pat = regex.compile(r'''##(?P<tag>\w+)=<(?:(?P<key>\w+)=(?P<value>[^,<>]*),?)*>''')
match = pat.search(s)
print([match.group("tag"), list(zip(match.captures("key"), match.captures("value")))])
See Python proof | Regex explanation
--------------------------------------------------------------------------------
## '##'
--------------------------------------------------------------------------------
(?P<tag> group and capture to \k<tag>:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \k<tag>
--------------------------------------------------------------------------------
=< '=<'
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
--------------------------------------------------------------------------------
(?P<key> group and capture to \k<key>:
--------------------------------------------------------------------------------
\w+ word characters (a-z, A-Z, 0-9, _) (1
or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \k<key>
--------------------------------------------------------------------------------
= '='
--------------------------------------------------------------------------------
(?P<value> group and capture to \k<value>:
--------------------------------------------------------------------------------
[^,<>]* any character except: ',', '<', '>' (0
or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \k<value>
--------------------------------------------------------------------------------
,? ',' (optional (matching the most amount
possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
> '>'
Results: ['TAG', [('key', 'val'), ('key2', 'val2')]]
Upvotes: 1