Franky Ray
Franky Ray

Reputation: 143

Python Split Regex not split what I need

I have this in my file

import re

sample = """Name: @s
Owner: @a[tag=Admin]"""

target = r"@[sae](\[[\w{}=, ]*\])?"
regex = re.split(target, sample)

print(regex)

I want to split all words that start with @, so like this:
["Name: ", "@s", "\nOwner: ", "@a[tag=Admin]"]

But instead it give this:
['Name: ', None, '\nOwner: ', '[tag=Admin]', '']

How to seperating it?

Upvotes: 3

Views: 549

Answers (4)

The fourth bird
The fourth bird

Reputation: 163372

In your output, you keep the [tag=Admin] as that part is in a capture group, and using split can also return empty strings.

Another option is to be specific about the allowed data format, and instead of split capture the parts in 2 groups.

(\s*\w+:\s*)(@[sae](?:\[[\w{}=, ]*])?)

The pattern matches:

  • ( Capture group 1
    • \s*\w+:\s* Match 1+ word characters and : between optional whitespace chars
  • ) Close group
  • ( Capture group 2
    • @[sae] Match @ followed by either s a e
    • (?:\[[\w{}=, ]*])? Optionally match [...]
  • ) Close group

Example code:

import re

sample = """Name: @s
Owner: @a[tag=Admin]"""
target = r"(\s*\w+:\s*)(@[sae](?:\[[\w{}=, ]*])?)"

listOfTuples = re.findall(target, sample)
lst = [s for tpl in listOfTuples for s in tpl]
print(lst) 

Output

['Name: ', '@s', '\nOwner: ', '@a[tag=Admin]']

See a regex demo and a Python demo.

Upvotes: 0

rici
rici

Reputation: 241821

re.split expects the regular expression to match the delimiters in the string. It only returns the parts of the delimiters which are captured. In the case of your regex, that's only the part between the brackets, if present.

If you want the whole delimiter to show up in the list, put parentheses around the whole regex:

target = r"(@[sae](\[[\w{}=, ]*\])?)"

But you are probably better off not capturing the interior group. You can change it to a non-capturing group by using (?:…) instead of (…):

target = r"(@[sae](?:\[[\w{}=, ]*\])?)"

Upvotes: 0

Cary Swoveland
Cary Swoveland

Reputation: 110685

If I understand the requirements correctly you could do that as follows:

import re
s = """Name: @s
Owner: @a[tag=Admin]
"""
rgx = r'(?=@.*)|(?=\r?\n[^@\r\n]*)'
re.split(rgx, s)
  #=> ['Name: ', '@s', '\nOwner: ', '@a[tag=Admin]\n']

Demo

The regular expression can be broken down as follows.

(?=         # begin a positive lookahead
  @.*       # match '@' followed by >= 0 chars other than line terminators
)           # end positive lookahead
|           # or
(?=         # begin a positive lookahead
  \r?\n     # match a line terminator
  [^@\r\n]* # match >= 0 characters other than '@' and line terminators 
)           # end positive lookahead

Notice that matches are zero-width.

Upvotes: 3

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521639

I would use re.findall here:

sample = """Name: @s
Owner: @a[tag=Admin]"""
parts = re.findall(r'@\w+(?:\[.*?\])?|\s*\S+\s*', sample)
print(parts)  # ['Name: ', '@s', '\nOwner: ', '@a[tag=Admin]']

The regex pattern used here says to match:

@\w+          a tag @some_tag
(?:\[.*?\])?  followed by an optional [...] term
|             OR
\s*\S+\s*     any other non whitespace term,
              including optional whitespace on both sides

Upvotes: 3

Related Questions