sciroccorics
sciroccorics

Reputation: 2427

How to escape specific whitespaces when splitting line into words with regex

I want to split a string into a list of words (here "word" means arbitrary sequence of non-whitespace characters), but also keep the groups of consecutive whitespaces that have been used as separators (because the number of whitespaces is significant in my data). For this simple task, I know that the following regex would do the job (I use Python as an illustrative language, but the code can be easily adapted to any language including regexes):

import re
regexA = re.compile(r"(\S+)")
print(regexA.split("aa b+b   cc dd!    :ee  "))

produces the expected output:

['', 'aa', ' ', 'b+b', '   ', 'cc', ' ', 'dd!', '    ', ':ee', '  ']

Now the hard part: when a word includes an opening parenthesis, all the whitespaces encountered until the matching closing parenthesis should not be considered as word separators. In other words:

regexB.split("aa b+b   cc(dd! :ee (ff gg) hh) ii  ")

should produce:

['', 'aa', ' ', 'b+b', '   ', 'cc(dd! :ee (ff gg) hh)', ' ', 'ii', '  ']

Using

regexB = re.compile(r'([^(\s]*\([^)]*\)|\S+)')

works for a single pair of parentheses, but fails when there are inner parentheses. How could I improve the regex to correctly skip inner parentheses?

And the final question: in my data, only words starting with % should be tested for the "parenthesis rule" (regexB), the other words should be treated by regexA. I have no idea how to combine two regexes in a single split.

Any hint is warmly welcome...

Upvotes: 0

Views: 80

Answers (2)

sciroccorics
sciroccorics

Reputation: 2427

Finally after having tested several ideas based on the answers proposed by @Wiktor Stribiżew and @Thm Lee, I came to bunch of solutions dealing with different levels of complexity. To reduce dependency, I wanted to stick to the re module from the Python standard library, so here is the code:

import re

text = "aa b%b(   %cc(dd! (:ee ff) gg) %hh ii)  "

# Solution 1: don't process parentheses at all
regexA = re.compile(r'(\S+)')
print(regexA.split(text))

# Solution 2: works for non-nested parentheses
regexB = re.compile(r'(%[^(\s]*\([^)]*\)|\S+)')
print(regexB.split(text))

# Solution 3: works for one level of nested parentheses
regexC = re.compile(r'(%[^(\s]*\((?:[^()]*\([^)]*\))*[^)]*\)|\S+)')
print(regexC.split(text))

# Solution 4: works for arbitrary levels of nested parentheses
n, words = 0, []
for word in regexA.split(text):
    if n: words[-1] += word
    else: words.append(word)
    if n or (word and word[0] == '%'):
        n += word.count('(') - word.count(')')
print(words)

Here is the generated output:

Solution 1: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 2: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 3: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 4: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', '  ']

As stated in the OP, for my specific data, escaping whitespaces in parentheses has only to be done for words starting with %, other parentheses (e.g. word b%b( in my example) are not considered are special. If you want to escape whitespaces inside any pair of parentheses, simply remove the %char in the regexes. Here is the result with that modification:

Solution 1: ['', 'aa', ' ', 'b%b(', '   ', '%cc(dd!', ' ', '(:ee', ' ', 'ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 2: ['', 'aa', ' ', 'b%b(   %cc(dd! (:ee ff)', ' ', 'gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 3: ['', 'aa', ' ', 'b%b(   %cc(dd! (:ee ff) gg)', ' ', '%hh', ' ', 'ii)', '  ']
Solution 4: ['', 'aa', ' ', 'b%b(   %cc(dd! (:ee ff) gg) %hh ii)', '  ']

Upvotes: 0

Thm Lee
Thm Lee

Reputation: 1236

In the PCRE regex engine, sub-routine is supported and recursive pattern seems workable for the case including balanced nested parentheses.

(?m)\s+(?=[^()]*(\([^()]*(?1)?[^()]*\))*[^()]*$)

Demo,,, in which (?1) means calling sub-routine 1, (\([^()]*(?1)?[^()]*\)), namely recursive pattern which includes caller, (?1)

But python does not support sub-routinepattern in regex.

So I tried first replacing every ( , ) with another distinctive character( @ in this example) and applying the regex to split and finally turn @ back to ( or ) respectively in my pythone script.

Regex for spliting.

(?m)(\s+)(?=[^@]*(?:(?:@[^@]*){2})*$)

Demo,,, in which I changed your separator \S+ to consecutive spaces \s+ because @,(,) are included in [\S]' possible characters set.

Python script may be like this

import re
ss="""aa b+b   cc(dd! :ee ((ff gg)) hh) ii  """
ss=re.sub(r"\(|\)","@",ss)      #repacing every `(`,`)` to `@`

regx=re.compile(r"(?m)(\s+)(?=[^@]*(?:(?:@[^@]*){2})*$)")
m=regx.split(ss)
for i in range(len(m)):         # turn `@` back to `(` or `)` respectively 
    n= m[i].count('@')
    if n < 2: continue
    else: 
        for j in range(int(n/2)):
            k=m[i].find('@'); m[i]=m[i][:k]+'('+m[i][k+1:]
        m[i]= m[i].replace("@",')')
print(m)

Output is

['aa', ' ', 'b+b', '   ', 'cc(dd! :ee ((ff gg)) hh)', ' ', 'ii', '  ', '']

Upvotes: 1

Related Questions