Why am I getting blank strings here?

Question

I want to split a string by individual newlines or groups of spaces. I got the result, except for '' strings. How do I eliminate those?

Edit: I need the output to retain whitespace groups and split on each newline. The only unwanted things are the ''.

In [208]: re.split('(
|\ +)', 'many   fancy word 

    hello    	   hi')
Out[208]: 
['many',
 '   ',
 'fancy',
 ' ',
 'word',
 ' ',
 '',
 '
',
 '',
 '
',
 '',
 '    ',
 'hello',
 '    ',
 '	',
 '   ',
 'hi']

falsetru · Accepted Answer

If the pattern include capturing group, those separators are included in the result list.

If you don't use the capturing group or replacing the capturing group ((...)) with non-capturing group ((?:...)), the separators are not included.

# Not using group at all
>>> re.split('
|\ +', 'many   fancy word 

    hello    	   hi')
['many', 'fancy', 'word', '', '', '', 'hello', '	', 'hi']


# Using non-capturing group
>>> re.split('(?:
|\ +)', 'many   fancy word 

    hello    	   hi')
['many', 'fancy', 'word', '', '', '', 'hello', '	', 'hi']

Quoting re.split document:

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.

UPDATE According to question edit:

You can filter empty string out using filter(None, ..):

list(filter(None, re.split('(
|\ +)', 'many fancy word 

 hello 	 hi')))

or using re.findall with modified pattern:

re.findall('
|\ +|[^
 ]+', 'many fancy word 

 hello 	 hi')
# `[^
 ]` matches any character that is not a newline nor a space.

Why am I getting blank strings here?

Answers (1)

Related Questions