user185228
user185228

Reputation:

Regex for extraction in Python

I have a string like this:

"a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more".

I would like to get this as an output:

(("bla", 123, 456), ("bli", 789, 123), ("blu", 789))

I haven't been able to find the proper python regex to achieve that.

Upvotes: 0

Views: 464

Answers (7)

PaulMcG
PaulMcG

Reputation: 63772

Is pyparsing overkill for this? Maybe, but without too much suffering, it does deliver the desired output, without a thicket of backslashes to escape the '{', '|', or '}' characters. Plus, there's no need for post-parse conversions of integers and whatnot - the parse actions take care of this kind of stuff at parse time.

from pyparsing import Word, Suppress, alphas, alphanums, nums, delimitedList

LBRACE,RBRACE,VERT = map(Suppress,"{}|")
word = Word(alphas,alphanums)
integer = Word(nums)
integer.setParseAction(lambda t: int(t[0]))

patt = (LBRACE*2 + delimitedList(word|integer, VERT) + RBRACE*2)
patt.setParseAction(lambda toks:tuple(toks.asList()))


s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

print tuple(p[0] for p in patt.searchString(s))

Prints:

(('bla', 123, 456), ('bli', 789, 123), ('blu', 789))

Upvotes: 0

sth
sth

Reputation: 229844

You need a lot of escapes in your regular expression since {, } and | are special characters in them. A first step to extract the relevant parts of the string would be this:

regex = re.compile(r'\{\{(.*?)\|(.*?)(?:\|(.*?))?\}\}')
regex.findall(line)

For the example this gives:

[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

Then you can continue with converting strings with digits into integers and removing empty strings like for the last match.

Upvotes: 1

steveha
steveha

Reputation: 76765

We might be able to get fancy and do everything in a single complicated regular expression, but that way lies madness. Let's do one regexp that grabs the groups, and then split the groups up. We could use a regexp to split the groups, but we can just use str.split(), so let's do that.

import re
pat_group = re.compile("{{([^}]*)}}")
def mixed_tuple(iterable):
    lst = []
    for x in iterable:
        try:
            lst.append(int(x))
        except ValueError:
            lst.append(x)
    return tuple(lst)

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"

lst_groups = re.findall(pat_group, s)
lst = [mixed_tuple(x.split("|")) for x in lst_groups]

In pat_group, "{{" just matches literal "{{". "(" starts a group. "[^}]" is a character class that matches any character except for "}", and '*' allows it to match zero or more such characters. ")" closes out the group and "}}" matches literal characters. Thus, we match the "{{...}}" patterns, and can extract everything between the curly braces as a group.

re.findall() returns a list of groups matched from the pattern.

Finally, a list comprehension splits each string and returns the result as a tuple.

Upvotes: 0

Kenan Banks
Kenan Banks

Reputation: 212138

Assuming your actual format is {{[a-z]+|[0-9]+|[0-9]+}}, here's a complete program with conversion to ints.

import re

s = "a word {{bla|123|456}} another {{bli|789|123}} some more text {{blu|789}} and more"
result = []

for match in re.finditer('{{.*?}}', s):

   # Split on pipe (|) and filter out non-alphanumerics
   parts = [filter(str.isalnum, part) for part in match.group().split('|')]

   # Convert to int when possible
   for index, part in enumerate(parts):      
      try:
         parts[index] = int(part)
      except ValueError:
         pass

   result.append(tuple(parts))

Upvotes: 0

Joakim Lundborg
Joakim Lundborg

Reputation: 11210

To get the exact output you wrote, you need a regex and a split:

import re
map(lambda s: s.split("|"), re.findall(r"\{\{([^}]*)\}\}", s))

To get it with the numbers converted, do this:

toint = lambda x: int(x) if x.isdigit() else x
[map(toint, p.split("|")) for p in re.findall(r"\{\{([^}]*)\}\}", s)]

Upvotes: 0

Jeff B
Jeff B

Reputation: 30099

[re.split('\|', i) for i in re.findall("{{(.*?)}}", str)]

Returns:

[['bla', '123', '456'], ['bli', '789', '123'], ['blu', '789']]

This method works regardless of the number of elements in the {{ }} blocks.

Upvotes: 0

SilentGhost
SilentGhost

Reputation: 319929

>>> re.findall(' {{(\w+)\|(\w+)(?:\|(\w+))?}} ', s)
[('bla', '123', '456'), ('bli', '789', '123'), ('blu', '789', '')]

if you still want number there you'd need to iterate over the output and convert it to the integer with int.

Upvotes: 1

Related Questions