Multiple captures within a string

Question

Haven't found a Q/A on SO that quite answers this situation. I have implemented solutions from some to get as far as I have.

I'm parsing the header (metadata) part of VCF files. Each line has the format:

##TAG=

I have a regex that parses the multiple k-v pairs inside the <>, but I can't seem to add in the <> and have it still "work."

s = 'a=1,b=two,c="three"'

pat = re.compile(r'''(?P\w+)=(?P[^,]*),?''')
match = pat.findall(s)
print(dict(match))
#{'a': '1', 'b': 'two', 'c': '"three"'}

Also,

s = 'a=1,b=two,c="three"'

pat = re.compile(r'''(?:(?P\w+)=(?P[^,]*),?)''')
match = pat.findall(s)
print(match)
print(dict(match))
#[('a', '1'), ('b', 'two'), ('c', '"three"')]
#{'a': '1', 'b': 'two', 'c': '"three"'}

So, I thought I might be able to do:

s = ''

pat = re.compile(r'''<(?:(?P\w+)=(?P[^,]*),?)>''')
match = pat.findall(s)
print(match)
print(dict(match))
#[]
#{}

If possible, I'd really like to do something like:

\#\#(?P)=<(?:(?P\w+)=(?P[^,]*),?)>

and capture the TAG and all of the k-v pairs. And obviously, I'd like it to "work."

I realize the "proper" solution here is likely to use a parser rather than regex. But I'm a bioinformatics person, not a programmer. And the format is very consistent and laid out in a standardized specification that is (almost) always followed.

Ryszard Czech · Accepted Answer

With PyPi regex:

import regex
s = '##TAG='
pat = regex.compile(r'''##(?P\w+)=<(?:(?P\w+)=(?P[^,<>]*),?)*>''')
match = pat.search(s)
print([match.group("tag"), list(zip(match.captures("key"), match.captures("value")))])

See Python proof | Regex explanation

--------------------------------------------------------------------------------
  ##                       '##'
--------------------------------------------------------------------------------
  (?P                  group and capture to \k:
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \k
--------------------------------------------------------------------------------
  =<                       '=<'
--------------------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    (?P                        group and capture to \k:
--------------------------------------------------------------------------------
      \w+                      word characters (a-z, A-Z, 0-9, _) (1
                               or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )                        end of \k
--------------------------------------------------------------------------------
    =                        '='
--------------------------------------------------------------------------------
    (?P                 group and capture to \k:
--------------------------------------------------------------------------------
      [^,<>]*                  any character except: ',', '<', '>' (0
                               or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
    )                        end of \k
--------------------------------------------------------------------------------
    ,?                       ',' (optional (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )*                       end of grouping
--------------------------------------------------------------------------------
  >                        '>'

Results: ['TAG', [('key', 'val'), ('key2', 'val2')]]

Multiple captures within a string

Answers (1)

Related Questions