Yuval Atzmon
Yuval Atzmon

Reputation: 5945

parsing a string in python: how to split newlines while ignoring newline inside quotes

I have a text that i need to parse in python.

It is a string where i would like to split it to a list of lines, however, if the newlines (\n) is inside quotes then we should ignore it.

for example:

abcd efgh ijk\n1234 567"qqqq\n---" 890\n

should be parsed into a list of the following lines:

abcd efgh ijk
1234 567"qqqq\n---" 890

I've tried to it with split('\n'), but i don't know how to ignore the quotes.

Any idea?

Thanks!

Upvotes: 5

Views: 2794

Answers (4)

njzk2
njzk2

Reputation: 39403

You can split it, then reduce it to put together the elements that have an odd number of " :

txt = 'abcd efgh ijk\n1234 567"qqqq\n---" 890\n'
s = txt.split('\n')
reduce(lambda x, y: x[:-1] + [x[-1] + '\n' + y] if x[-1].count('"') % 2 == 1 else x + [y], s[1:], [s[0]])
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

Explication:

if x[-1].count('"') % 2 == 1
# If there is an odd number of quotes to the last handled element
x[:-1] + [x[-1] + y]
# Append y to this element
else x + [y]
# Else append the element to the handled list

Can also be written like so:

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    for item in s:
        if res and res[-1].count('"') % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

As pointed out by @Veedrac, this is O(n^2), but this can be prevented by keeping track of the count of ":

def splitWithQuotes(txt):
    s = txt.split('\n')
    res = []
    cnt = 0
    for item in s:
        if res and cnt % 2 == 1:
            res[-1] = res[-1] + '\n' + item
        else:
            res.append(item)
            cnt = 0
        cnt += item.count('"')
    return res
splitWithQuotes(txt)
# ['abcd efgh ijk', '1234 567"qqqq\n---" 890', '']

(The last empty string is because of the last \n at the end of the input string.)

Upvotes: 4

Veedrac
Veedrac

Reputation: 60227

Here's a much easier solution.

Match groups of (?:"[^"]*"|.)+. Namely, "things in quotes or things that aren't newlines".

Example:

import re
re.findall('(?:"[^"]*"|.)+', text)

NOTE: This coalesces several newlines into one, as blank lines are ignored. To avoid that, give a null case as well: (?:"[^"]*"|.)+|(?!\Z).

The (?!\Z) is a confusing way to say "not the end of a string". The (?! ) is negative lookahead; the \Z is the "end of a string" part.


Tests:

import re

texts = (
    'text',
    '"text"',
    'text\ntext',
    '"text\ntext"',
    'text"text\ntext"text',
    'text"text\n"\ntext"text"',
    '"\n"\ntext"text"',
    '"\n"\n"\n"\n\n\n""\n"\n"'
)

line_matcher = re.compile('(?:"[^"]*"|.)+')

for text in texts:
    print("{:>27} → {}".format(
        text.replace("\n", "\\n"),
        " [LINE] ".join(line_matcher.findall(text)).replace("\n", "\\n")
    ))

#>>>                        text → text
#>>>                      "text" → "text"
#>>>                  text\ntext → text [LINE] text
#>>>                "text\ntext" → "text\ntext"
#>>>        text"text\ntext"text → text"text\ntext"text
#>>>    text"text\n"\ntext"text" → text"text\n" [LINE] text"text"
#>>>            "\n"\ntext"text" → "\n" [LINE] text"text"
#>>>    "\n"\n"\n"\n\n\n""\n"\n" → "\n" [LINE] "\n" [LINE] "" [LINE] "\n"

Upvotes: 8

igortg
igortg

Reputation: 160

There are many ways to accomplish that. I came up with a very simple one:

splitted = [""]
for i, x in enumerate(re.split('"', text)):
    if i % 2 == 0:
        lines = x.split('\n')
        splitted[-1] += lines[0]
        splitted.extend(lines[1:])
    else:
        splitted[-1] += '"{0}"'.format(x)

Upvotes: 1

georg
georg

Reputation: 215049

Ok, this seems to work (assuming quotes are properly balanced):

rx = r"""(?x)
    \n
    (?!
        [^"]*
        "
        (?=
            [^"]*
            (?:
                " [^"]* "
                [^"]*
            )*
            $
        )
    )
"""

Test:

str = """\
first
second "qqq
     qqq
     qqq
     " line
"third
    line" AND "spam
        ham" AND "more
            quotes"
end \
"""

import re


for x in re.split(rx, str):
    print '[%s]' % x

Result:

[first]
[second "qqq
     qqq
     qqq
     " line]
["third
    line" AND "spam
        ham" AND "more
            quotes"]
[end ]

If the above looks too weird for you, you can also do this in two steps:

str = re.sub(r'"[^"]*"', lambda m: m.group(0).replace('\n', '\x01'), str)
lines = [x.replace('\x01', '\n') for x in str.splitlines()]

for line in lines:
    print '[%s]' % line  # same result

Upvotes: 1

Related Questions