Jesvin Jose
Jesvin Jose

Reputation: 23078

Why am I getting blank strings here?

I want to split a string by individual newlines or groups of spaces. I got the result, except for '' strings. How do I eliminate those?

Edit: I need the output to retain whitespace groups and split on each newline. The only unwanted things are the ''.

In [208]: re.split('(\n|\ +)', 'many   fancy word \n\n    hello    \t   hi')
Out[208]: 
['many',
 '   ',
 'fancy',
 ' ',
 'word',
 ' ',
 '',
 '\n',
 '',
 '\n',
 '',
 '    ',
 'hello',
 '    ',
 '\t',
 '   ',
 'hi']

Upvotes: 1

Views: 65

Answers (1)

falsetru
falsetru

Reputation: 368954

If the pattern include capturing group, those separators are included in the result list.

If you don't use the capturing group or replacing the capturing group ((...)) with non-capturing group ((?:...)), the separators are not included.

# Not using group at all
>>> re.split('\n|\ +', 'many   fancy word \n\n    hello    \t   hi')
['many', 'fancy', 'word', '', '', '', 'hello', '\t', 'hi']


# Using non-capturing group
>>> re.split('(?:\n|\ +)', 'many   fancy word \n\n    hello    \t   hi')
['many', 'fancy', 'word', '', '', '', 'hello', '\t', 'hi']

Quoting re.split document:

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.


UPDATE According to question edit:

You can filter empty string out using filter(None, ..):

list(filter(None, re.split('(\n|\ +)', 'many fancy word \n\n hello \t hi')))

or using re.findall with modified pattern:

re.findall('\n|\ +|[^\n ]+', 'many fancy word \n\n hello \t hi')
# `[^\n ]` matches any character that is not a newline nor a space.

Upvotes: 2

Related Questions