Reputation: 122148
Given a string like this:
ORTH < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",
With regex, how do I get a tuple that looks like the following:
('ORTH', ['cali.ber,kl','calf','done'])
I've been doing it as such:
txt = '''ORTH < "cali.ber,kl", 'calf' , "done" >,'''
e1 = txt.partition(" ")[0]
vs = re.search(r"<([A-Za-z0-9_]+)>", txt)
v = vs.group(1)
v1 = [i[1:-1] for i in vs.strip().strip("<>").split(",")]
print v1
But i'm getting none for re.search().group(1)
. How should it be done to get the desired output?
Upvotes: 1
Views: 349
Reputation: 32630
The reason you don't get a match is that your regex doesn't match:
r"<([A-Za-z0-9_]+)>"
is missing comma, quotation marks and the space character, which all can occur inside the < >
according to your sample.
This one would match:
re.search(r"< ([A-Za-z0-9_.,\"' ]+) >", txt)
What also may trip you up is the fact that the list of names is delimited by comma, which itself can be part of the values, unescaped.
That means you can't just split that string by ','
, but instead need to consider the two different quotation characters('
and "
) in order to separate the fields.
So I'd use this approach:
re.match
to split the string into PREFIX < NAMES > parts, and discard the rest.re.findall()
to split the names into fields according to quotation marks1) According to your first comment, your data can also contain a preamble before the prefix that contains newlines. The default behavior for .
is to match everything except newlines.
From the Python re
docs:
re.DOTALL
Make the
'.'
special character match any character at all, including a newline; without this flag,'.'
will match anything except a newline.
So you need to construct that regex with the re.DOTALL
flag. You do this by compiling it first and passing the OR
ed flags:
re.compile(pattern, flags=re.DOTALL)
2) If you include the space character before PREFIX
in the regex, it will only match for data that actually contains that space - but not anymore for your first piece of example data. So I use .*?([A-Z\.]*)...
to cover both cases. The ?
is for non-greedy matching, so it matches the shortest possible match instead of the longest.
3) To cover PREFIX.FOO
just extend the pattern for the prefix to ([A-Z\.]*)
by including the .
character and escaping it.
Updated example covering all the cases you mentioned:
import re
TEST_VALUES = [
"""ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",""",
"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel","""
]
EXPECTED = ('ORTH.FOO', ['cali.ber,kl','calf','done'])
pattern = re.compile(r'.*?([A-Z\.]*) < (.*) >.*', flags=re.DOTALL)
for value in TEST_VALUES:
prefix, names_str = pattern.match(value).groups()
names = re.findall('[\'"](.*?)["\']', names_str)
result = prefix, names
assert(result == EXPECTED)
print result
Upvotes: 2