Dulanic
Dulanic

Reputation: 87

Splitting string into groups with regex?

I have strings that can have a various amount of "groups". I need to split them, but I am having trouble doing so. The groups will always start with [A-Z]{2-5} followed by a : and a string or varying length and spaces. It will always have a space in front of the group.

Example strings:

"YellowSky AA:Hello AB:1234 AC:1F 322 AD:hj21jkhjk23"
"Billy Bob Thorton AA:213231 AB:aaaa AC:ddddd 322 AD:hj2ffs   dsfdsfd1jkhjk23"

My code thus far:

import re
D = "Test1 AA:Hello AB:1234 AC:1F 322 AD:hj21jkhjk23"
    
g = re.compile("(?<!^)\s+(?=[A-Z])(?!.\s)").split(D)

As you can see... this works for one word starting string, but not multiple words.

Works

But this fails /w spaces: Doesn't work

Upvotes: 4

Views: 2314

Answers (2)

vks
vks

Reputation: 67968

([A-Z]{2,5}:\w+(?: +\w+)*)(?=(?: +[A-Z]+:|$))

You can also use re.findall directly.

See demo.

https://regex101.com/r/6jf8EM/1

This way you don't need to filter unwanted groups later. You get what you need.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626690

You can use

re.split(r'(?!^)\s+(?=[A-Z]+:)', text)

See this regex demo.

Details:

  • (?!^) - a negative lookahead that matches a location not at the start of string (equal to (?<!^) but one char shorter)
  • \s+ - one or more whitespaces
  • (?=[A-Z]+:) - a positive lookahead that requires one or more uppercase ASCII letters followed with a : char immediately to the right of the current location.

Upvotes: 2

Related Questions