How do i extract a list of elements encased in quotation marks bounded by and delimited by commas - python, regex?

Question

Given a string like this:

ORTH < "cali.ber,kl", 'calf' , "done" >,
LKEYS.KEYREL.PRED "_calf_n_1_rel",

With regex, how do I get a tuple that looks like the following:

('ORTH', ['cali.ber,kl','calf','done'])

I've been doing it as such:

txt = '''ORTH < "cali.ber,kl", 'calf' , "done" >,'''
e1 = txt.partition(" ")[0]
vs = re.search(r"<([A-Za-z0-9_]+)>", txt)
v = vs.group(1)
v1 = [i[1:-1] for i in vs.strip().strip("<>").split(",")]
print v1

But i'm getting none for re.search().group(1). How should it be done to get the desired output?

Lukas Graf · Accepted Answer

The reason you don't get a match is that your regex doesn't match:

r"<([A-Za-z0-9_]+)>" is missing comma, quotation marks and the space character, which all can occur inside the < > according to your sample.

This one would match:

re.search(r"< ([A-Za-z0-9_.,"' ]+) >", txt)

What also may trip you up is the fact that the list of names is delimited by comma, which itself can be part of the values, unescaped.

That means you can't just split that string by ',', but instead need to consider the two different quotation characters(' and " ) in order to separate the fields.

So I'd use this approach:

Use re.match to split the string into PREFIX < NAMES > parts, and discard the rest.
Use re.findall() to split the names into fields according to quotation marks

Edit:

1) According to your first comment, your data can also contain a preamble before the prefix that contains newlines. The default behavior for . is to match everything except newlines.

From the Python re docs:

re.DOTALL

Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

So you need to construct that regex with the re.DOTALL flag. You do this by compiling it first and passing the ORed flags:

re.compile(pattern, flags=re.DOTALL)

2) If you include the space character before PREFIX in the regex, it will only match for data that actually contains that space - but not anymore for your first piece of example data. So I use .*?([A-Z\.]*)... to cover both cases. The ? is for non-greedy matching, so it matches the shortest possible match instead of the longest.

3) To cover PREFIX.FOO just extend the pattern for the prefix to ([A-Z\.]*) by including the . character and escaping it.

Updated example covering all the cases you mentioned:

import re

TEST_VALUES = [
    """ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,
LKEYS.KEYREL.PRED "_calf_n_1_rel",""",
    """calf_n1 := n_-_c_le & n_-_pn_le &
 [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,
LKEYS.KEYREL.PRED "_calf_n_1_rel","""
]

EXPECTED = ('ORTH.FOO', ['cali.ber,kl','calf','done'])


pattern = re.compile(r'.*?([A-Z\.]*) < (.*) >.*', flags=re.DOTALL)


for value in TEST_VALUES:
    prefix, names_str = pattern.match(value).groups()
    names = re.findall('[\'"](.*?)["\']', names_str)

    result = prefix, names
    assert(result == EXPECTED)

print result

How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?

Answers (1)

Edit:

Related Questions

How do i extract a list of elements encased in quotation marks bounded by &lt;&gt; and delimited by commas - python, regex?

Answers (1)

Edit:

Related Questions

How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?