Reputation: 23078
I want to split a string by individual newlines or groups of spaces. I got the result, except for ''
strings. How do I eliminate those?
Edit: I need the output to retain whitespace groups and split on each newline. The only unwanted things are the ''
.
In [208]: re.split('(\n|\ +)', 'many fancy word \n\n hello \t hi')
Out[208]:
['many',
' ',
'fancy',
' ',
'word',
' ',
'',
'\n',
'',
'\n',
'',
' ',
'hello',
' ',
'\t',
' ',
'hi']
Upvotes: 1
Views: 65
Reputation: 368954
If the pattern include capturing group, those separators are included in the result list.
If you don't use the capturing group or replacing the capturing group ((...)
) with non-capturing group ((?:...)
), the separators are not included.
# Not using group at all
>>> re.split('\n|\ +', 'many fancy word \n\n hello \t hi')
['many', 'fancy', 'word', '', '', '', 'hello', '\t', 'hi']
# Using non-capturing group
>>> re.split('(?:\n|\ +)', 'many fancy word \n\n hello \t hi')
['many', 'fancy', 'word', '', '', '', 'hello', '\t', 'hi']
Quoting re.split
document:
Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.
UPDATE According to question edit:
You can filter empty string out using filter(None, ..)
:
list(filter(None, re.split('(\n|\ +)', 'many fancy word \n\n hello \t hi')))
or using re.findall
with modified pattern:
re.findall('\n|\ +|[^\n ]+', 'many fancy word \n\n hello \t hi')
# `[^\n ]` matches any character that is not a newline nor a space.
Upvotes: 2