Reputation: 21451
My goal is to make a regex that can handle 2 situations:
The unorderedness combined with the different cases for newline and no newline is what makes this complex.
What is the most efficient way to do this?
E.g.
' \n \n \n a' # --> '\na'
' \t \t a' # --> ' a'
' \na\n ' # --> '\na\n'
Benchmark:
s = ' \n \n \n a \t \t a \na\n '
n_times = 1000000
------------------------------------------------------
change_whitespace(s) - 5.87 s
change_whitespace_2(s) - 3.51 s
change_whitespace_3(s) - 3.93 s
n_times = 100000
------------------------------------------------------
change_whitespace(s * 100) - 27.9 s
change_whitespace_2(s * 100) - 16.8 s
change_whitespace_3(s * 100) - 19.7 s
Upvotes: 2
Views: 1077
Reputation: 49330
This replaces the whitespace that contains a newline with a single newline, then replaces the whitespace that doesn't contain a newline with a single space.
import re
def change_whitespace(string):
return re.sub('[ \t\f\v]+', ' ', re.sub('[\s]*[\n\r]+[\s]*', '\n', string))
Results:
>>> change_whitespace(' \n \n \n a')
'\na'
>>> change_whitespace(' \t \t a')
' a'
>>> change_whitespace(' \na\n ')
'\na\n'
Thanks to @sln for reminding me of regex callback functions:
def change_whitespace_2(string):
return re.sub('\s+', lambda x: '\n' if '\n' in x.group(0) else ' ', string)
Results:
>>> change_whitespace_2(' \n \n \n a')
'\na'
>>> change_whitespace_2(' \t \t a')
' a'
>>> change_whitespace_2(' \na\n ')
'\na\n'
And here's a function with @sln's expression:
def change_whitespace_3(string):
return re.sub('(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)', lambda x: ' ' if x.group(1) else '\n', string)
Results:
>>> change_whitespace_3(' \n \n \n a')
'\na'
>>> change_whitespace_3(' \t \t a')
' a'
>>> change_whitespace_3(' \na\n ')
'\na\n'
Upvotes: 1
Reputation:
(Assumes Python can do regex replace with callback function)
You could use some callback to see what the replacement needs to be.
Group 1 matches, replace with space.
Group 2 matches, replace with newline
(?<!\s)(?:([^\S\r\n]+)|(\s+))(?!\s)
(?<! \s ) # No whitespace behind
(?:
( [^\S\r\n]+ ) # (1), Non-linebreak whitespace
|
( \s+ ) # (2), At least 1 linebreak
)
(?! \s ) # No whitespace ahead
Upvotes: 2