PascalVKooten
PascalVKooten

Reputation: 21451

Regex should handle whitespace including newline differently

My goal is to make a regex that can handle 2 situations:

The unorderedness combined with the different cases for newline and no newline is what makes this complex.

What is the most efficient way to do this?

E.g.

'   \n \n \n a'     # --> '\na'
'   \t \t    a'     # --> ' a'  
'   \na\n     '     # --> '\na\n'

Benchmark:

s = '   \n \n \n a   \t \t    a   \na\n     '
n_times = 1000000
------------------------------------------------------
change_whitespace(s)   - 5.87 s
change_whitespace_2(s) - 3.51 s
change_whitespace_3(s) - 3.93 s

n_times = 100000
------------------------------------------------------
change_whitespace(s * 100)    - 27.9 s 
change_whitespace_2(s * 100)  - 16.8 s 
change_whitespace_3(s * 100)  - 19.7 s    

Upvotes: 2

Views: 1077

Answers (2)

TigerhawkT3
TigerhawkT3

Reputation: 49330

This replaces the whitespace that contains a newline with a single newline, then replaces the whitespace that doesn't contain a newline with a single space.

import re

def change_whitespace(string):
    return re.sub('[ \t\f\v]+', ' ', re.sub('[\s]*[\n\r]+[\s]*', '\n', string))

Results:

>>> change_whitespace('   \n \n \n a')
'\na'
>>> change_whitespace('   \t \t    a')
' a'
>>> change_whitespace('   \na\n     ')
'\na\n'

Thanks to @sln for reminding me of regex callback functions:

def change_whitespace_2(string):
    return re.sub('\s+', lambda x: '\n' if '\n' in x.group(0) else ' ', string)

Results:

>>> change_whitespace_2('   \n \n \n a')
'\na'
>>> change_whitespace_2('   \t \t    a')
' a'
>>> change_whitespace_2('   \na\n     ')
'\na\n'

And here's a function with @sln's expression:

def change_whitespace_3(string):
    return re.sub('(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)', lambda x: ' ' if x.group(1) else '\n', string)

Results:

>>> change_whitespace_3('   \n \n \n a')
'\na'
>>> change_whitespace_3('   \t \t    a')
' a'
>>> change_whitespace_3('   \na\n     ')
'\na\n'

Upvotes: 1

user557597
user557597

Reputation:

(Assumes Python can do regex replace with callback function)

You could use some callback to see what the replacement needs to be.
Group 1 matches, replace with space.
Group 2 matches, replace with newline

(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)

 (?<! \s )           # No whitespace behind
 (?:
      ( [^\S\r\n]+ )      # (1), Non-linebreak whitespace
   |  
      ( \s+ )             # (2), At least 1 linebreak
 )
 (?! \s )            # No whitespace ahead

Upvotes: 2

Related Questions