Reputation: 13
I have string like that:
base | text1: 0.01 | text2: 0.02 | text3: 0.03
And I need to extract first word and all other text-number pairs. So this result I expect:
("base", "text1", "0.01", "text2", "0.02", "text3", "0.03")
I trying this regexp:
r"^(\w+)(?:\s+\|\s+)(?:([\w\s]*)\:\s([0-9.]+)(?:\s+\|\s+)?)+$"
But it captures only the last text-numberr pair:
("base", "text3", "0.03")
Here the full code I use:
import re
sr = "base | text1: 0.01 | text2: 0.02 | text3: 100.1"
pattern = r"^(\w+)(?:\s+\|\s+)(?:([\w\s]*)\:\s([0-9.]+)(?:\s+\|\s+)?)+$"
result = re.findall(pattern, sr)
print(result.groups())
Thank you!
Upvotes: 1
Views: 122
Reputation: 4564
I suggest something like this:
import re
sr = "base | text1: 0.01 | text2: 0.02 | text3: 100.1"
pattern1 = r"^(\w+)((?:\s+\|\s+[\w\s]+\s*:\s*\d+\.\d+)+)$"
bases = re.findall (pattern1, sr)
for base in bases:
result = [base[0]]
pattern2 = r"\|\s+([\w\s]+)\s*:\s*(\d+\.\d+)"
texts = re.findall(pattern2, base[1])
for text in texts:
result.append(text[0])
result.append(text[1])
print (result)
Note the simplified regular expressions.
Upvotes: 0
Reputation: 163632
One option to get the desired result is to split on either a space pipe space or colon space.
(?: \| |: )
Example code
import re
s="base | text1: 0.01 | text2: 0.02 | text3: 0.03"
print(re.split(r"(?: \| |: )", s))
Output
['base', 'text1', '0.01', 'text2', '0.02', 'text3', '0.03']
Another option could be using the PyPi regex module and make use of the \G
anchor and capturing groups, where the first word is in group 1, and the pairs are in group 2 and 3.
(?:^(\w+)|\G(?!^))\s+\|\s+(\w+):\s+(\d+\.\d+)
Upvotes: 2