jakob.j
jakob.j

Reputation: 963

Regex for splitting at newlines while ignoring newlines inside text surrounded by arbitrary number of quotes

In Python, I need to split a string at newlines while ignoring newlines inside text parts which are surrounded by an arbitrary number of quotes (e.g. """This is text in triple-quotes""", with the same number of quotes at start and end).

This example string:

Line outside quotes
Another line
"Two lines inside
normal quotes"
""Two lines inside
fancy "dual" quotes""
"""Three lines inside
"even fancier"
triple quotes"""
Last line

Should produce the list:

['Line outside quotes', 'Another line', '"Two lines inside\nnormal quotes"', 
 '""Two lines inside\nfancy "dual" quotes""', 
 '"""Three lines inside\n"even fancier"\ntriple quotes"""', 
 'Last line']

Inspried by this answer from Veedrac, I came up with the following regex to match groups:

(?:("+)[\s\S]+?\1|.)+

with the part ("+)[\s\S]+?\1 meaning "find a number of quotes (matching group), then a number of anything (not greedy), and finally the matching group again (same number of quotes)".

According to a test on RegExr.com, this regex works as I would expect: https://regexr.com/52qla

However, if I implement this in Python, I get an unexpected result. My test code:

import re

input = '''Line outside quotes
Another line
"Two lines inside
normal quotes"
""Two lines inside
fancy "dual" quotes""
"""Three lines inside
"even fancier"
triple quotes"""
Last line'''

matcher = re.compile(r'(?:("+)[\s\S]+?\1|.)+')
result = matcher.findall(input)

print(str(result))

Produces the output:

['', '', '"', '""', '"""', '']

which is not what I expect.

It doesn't seem to make a difference if I use the integrated "re" module or the "regex" module.

I hope someone has an idea. Thanks!

Upvotes: 2

Views: 738

Answers (4)

Mats Kindahl
Mats Kindahl

Reputation: 2075

The reason you get the strange list is because findall will return a tuple with all matching groups, and you have one group matching a quote, which is what will be returned.

Instead, put a real group around the complete match, and extract the right tuple using list-comprehension:

import re

input = '''Line outside quotes
Another line
"Two lines inside
normal quotes"
""Two lines inside
fancy "dual" quotes""
"""Three lines inside
"even fancier"
triple quotes"""
Last line'''

result = [x[0] for x in re.findall(r'((\"+)[\s\S]+?\2|.+)', input)]

print(str(result))

The call to findall here will return the list:

[('Line outside quotes', ''), ('Another line', ''), ('"Two lines inside\nnormal quotes"', '"'), ('""Two lines inside\nfancy "dual" quotes""', '""'), ('"""Three lines inside\n"even fancier"\ntriple quotes"""', '"""'), ('Last line', '')]

You can see that the first element of each tuple contain the string you want, while the second element is the (optional) match of the quotes, and the list-comprehension will extract the first element of each tuple in the list, generating the correct result:

['Line outside quotes', 'Another line', '"Two lines inside\nnormal quotes"', '""Two lines inside\nfancy "dual" quotes""', '"""Three lines inside\n"even fancier"\ntriple quotes"""', 'Last line']

Update: The code above does not handle the case that the line contain tokens outside the quotes, so to handle that we have to recognize that each line consists of one or more of the following tokens:

  • A token matching a quoted string, potentially with newlines inside the quotes.
  • A token consisting of non-newline non-quote characters.

This can be matched by using a non-capturing match to provide the two alternative tokens, and a capturing match to match the sequence of tokens:

import re

input = '''Line outside quotes
Another line
"Two lines inside
normal quotes"
""Two lines inside
fancy "dual" quotes""
"""Three lines inside
"even fancier"
triple quotes"""
Line that "quotes" something
"quotes" something
Line that "quotes"
Line that "quotes
with newline" something
'''

matches = [x[0] for x in re.findall(r'((?:(\"+)[\s\S]+?\2|[^"\n]+)+)', input)]

for match in matches:
    print("---")
    print(str(match))

Note that we need to change the .+ to not capture a any sequence that contain quotes (or newlines) or the greedy nature of regular expressions will start with matching a non-quote and then also gobble up the quote and stop at the newline, which will split the line (hard to explain better, test replacing the code and see what happens).

Upvotes: 1

E. Körner
E. Körner

Reputation: 122

This seems to work:

# [...]
matcher = re.compile(r'(?:("+)([\s\S]+?))(\1)|(.+)')
# [...]

produces:

[('', '', '', 'Line outside quotes'), ('', '', '', 'Another line'), ('"', 'Two lines inside\nnormal quotes', '"', ''), ('""', 'Two lines inside\nfancy "dual" quotes', '""', ''), ('"""', 'Three lines inside\n"even fancier"\ntriple quotes', '"""', ''), ('', '', '', 'Last line')]

I wrapped the quotes the the string in their own groups. The 'else' clause, if you may call it so, is |(.+).

So, if the first field is empty, it is an unquoted string, and contained in the last field. Else, the first three fields contain the quotes (front + back) and the inner string. A simple "".join(single_result_tuple) per result should suffice:

# [...]
result = ["".join(r) for r in result]
# [...]
['Line outside quotes', 'Another line', '"Two lines inside\nnormal quotes"', '""Two lines inside\nfancy "dual" quotes""', '"""Three lines inside\n"even fancier"\ntriple quotes"""', 'Last line']

(With named groups you may be better able to exactly extract your correct content.)


And with rearranging the groups: (wrapping everything in a group)

matcher = re.compile(r'(("+)[\s\S]+?\2|.+)')

you can get:

[('Line outside quotes', ''), ('Another line', ''), ('"Two lines inside\nnormal quotes"', '"'), ('""Two lines inside\nfancy "dual" quotes""', '""'), ('"""Three lines inside\n"even fancier"\ntriple quotes"""', '"""'), ('Last line', '')]

So you are able to check what quotation style was use. Content is in the first field:

# [...]
result = [r[0] for r in result]
# [...]

To completely solely get the string only, you must do some post-processing. The reference \2 needs a group (...) so you can't exclude it from the result with ?:. (Non-capturing group, if I remember correctly)

Upvotes: 1

Minu
Minu

Reputation: 438

import re

input = '''Line outside quotes
Another line
"Two lines inside
normal quotes"
""Two lines inside
fancy "dual" quotes""
"""Three lines inside
"even fancier"
triple quotes"""
Last line'''

matcher = re.compile(r'(?:("+)([\s\S]+?)\1|(.+))', re.MULTILINE)
result = matcher.findall(input)
print(["".join(x) for x in result])

I made what you want with the upper code. In python, You have to add re.MULTILINE for multiline processing. And for content export "[\s\S]+?" should be capsulized.

Upvotes: 3

ssm
ssm

Reputation: 5383

I tried the following:

import re

input = '''Line outside quotes
Another line
"Two lines inside
normal quotes"
""Two lines inside
fancy "dual" quotes""
"""Three lines inside
"even fancier"
triple quotes"""
Last line'''

matchers = [
    '(\n""")([A-Za-z].*?[A-Za-z])("""\n)', # 3 quotes
    '(\n"")([A-Za-z].*?[A-Za-z])(""\n)',   # 2 quotes
    '(\n")([A-Za-z].*?[A-Za-z])("\n)',     # single quote
]

allResults = []

for m in matchers:
    matcher = re.compile(m, re.MULTILINE|re.DOTALL)
    result = matcher.findall(input)

    allResults += [r[1] for r in result]
    input = matcher.subn("\n", input)[0]


allResults += input.split('\n')
print(allResults)

Basically, I don't know if it is possible to separate the single quote from the multi quotes. So, the idea is to go in stages, and extract the triple quotes, double quotes etc., one at a time.

This method looks very hackey. Maybe someone else will get inspired to do something interesting.

Upvotes: 1

Related Questions