gizgok
gizgok

Reputation: 7649

Tokenize string in python using regular expression

I've input data in the following format, not decided by me

key1: value1 key2: value2 key3: value3 key4 { key11: val11 key22: value22 } key5: value5 ............

The input string will have key values separated by colon or a brace bracket.

I want to tokenize it and I have the following idea: First to have a regular expression parsing data till I find a : or { with priority to { over :

Then split and read till the white space pattern that I said is reached and recursively traverse the whole string

I want to know if I can write a regex like (some_string)(special character pattern) (special character pattern can be : or { with precedence to {)(rest of the string)

If it is a : then for rest of the string, get the string part from ' value1 ' and capture it. Work on the remaining string

If it is a { then traverse till you find } and internally work with : logic defined above.

For eg

a: 1 b: 2 c { d: 3 e: 4 } f: 5

This should give

a:1
b:2
c { d: 3 e: 4 }
f: 5

Upvotes: 1

Views: 810

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

You can use this pattern:

[^ ]+(?:: [^ ]+| \{[^}]+\})

example:

import re
test = "a: 1 b: 2 c { d: 3 e: 4 } f: 5"
pattern = re.compile(r"[^ ]+(?:: [^ ]+| \{[^}]+\})")
for match in pattern.findall(test):
    print match

Upvotes: 4

Related Questions