Fisqkuz
Fisqkuz

Reputation: 99

Split a split (regex) in python

I do have got the below string and I am looking for a way to split it in order to consistently end up with the following output

'1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
['1GB 02060250396L1.060,70',
'2BE 129517720L2.639,40',
'3NL 134187650L4.024,23',
'4DE 165893440L8.111,00',
'5PL 65775644897L3.010,00',
'6DE 811506926L3.547,40',
'7AT U16235008L-830,00',
'8SE U57469158L8.0221,30']

My current approach re.split("([0-9][0-9][0-9][A-Z][A-Z])", input) however is also splitting my delimiter which gives and there is no other split possible than the one I am currently using in order to remain consistent. Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?

Upvotes: 1

Views: 78

Answers (2)

Daweo
Daweo

Reputation: 36390

Is it possible to split my delimiter as well and assign a part of it "70" to the string in front and a part "2BE" to the following string?

If you must use re.split AT ANY PRICE then you might exploit zero-length assertion for this task following way

import re
text = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'
parts = re.split(r'(?<=,[0-9][0-9])', text)
print(parts)

output

['1GB 02060250396L7.067,70', '2BE 129517720L6.633,40', '3NL 134187650L3.824,23', '4DE 165893440L3.111,00', '5PL 65775644897L1.010,00', '6DE 811506926L3.547,40', '7AT U16235008L-830,00', '8SE U57469158L3.001,30', '']

Explanation: This particular one is positive lookbehind, it does find zero-length substring preceded by , digit digit. Note that parts has superfluous empty str at end.

Upvotes: 1

pho
pho

Reputation: 25489

Use re.findall() instead of re.split().

You want to match

  • a number \d, followed by
  • two letters [A-Z]{2}, followed by
  • a space \s, followed by
  • a bunch of characters until you encounter a comma [^,]+, followed by
  • two digits \d{2}

Try it at regex101

So do:

input_str = '1GB 02060250396L7.067,702BE 129517720L6.633,403NL 134187650L3.824,234DE 165893440L3.111,005PL 65775644897L1.010,006DE 811506926L3.547,407AT U16235008L-830,008SE U57469158L3.001,30'

re.findall(r"\d[A-Z]{2}\s[^,]+,\d{2}", input_str)

Which gives

['1GB 02060250396L7.067,70',
 '2BE 129517720L6.633,40',
 '3NL 134187650L3.824,23',
 '4DE 165893440L3.111,00',
 '5PL 65775644897L1.010,00',
 '6DE 811506926L3.547,40',
 '7AT U16235008L-830,00',
 '8SE U57469158L3.001,30']

Alternatively, if you don't want to be so specific with your pattern, you could simply use the regex [^,]+,\d{2} Try it at regex101

This will match as many of any character except a comma, then a single comma, then two digits.

re.findall(r"[^,]+,\d{2}", input_str)

# Output:
['1GB 02060250396L7.067,70',
 '2BE 129517720L6.633,40',
 '3NL 134187650L3.824,23',
 '4DE 165893440L3.111,00',
 '5PL 65775644897L1.010,00',
 '6DE 811506926L3.547,40',
 '7AT U16235008L-830,00',
 '8SE U57469158L3.001,30']

Upvotes: 2

Related Questions