py_works
py_works

Reputation: 190

python regex to split by comma or space (but leave strings as it is)

I need to split a string by space or by comma. But it should leave single or double quoted strings as it is. Even if it is apart by many spaces or a single space it makes no difference. For e.g.:

    """ 1,' unchanged 1' " unchanged  2 "   2.009,-2e15 """

should return

    """ 1,' unchanged 1'," unchanged  2 ",2.009,-2e15 """

There may be no or more spaces before and after a comma. Those spaces are to be ignored. In this particular context, as shown in the ex string, if two quoted or double quoted strings happen to be next to each other, they will have a space in between or a comma.

I have a previous question at python reg ex to include missing commas, however, for that to work a splitting comma should have a space after.

Upvotes: 1

Views: 1004

Answers (1)

NZP
NZP

Reputation: 185

Edit: previous versions clobbered the newline that would, I assume, be in the file. Fixed now.

This is probably too much on the "if in doubt, use brute force" side, but it works:

regex = r"""(?<=["'])[^\S\n]+(?=["'])|(?<=["'])[^\S\n]+(?=\d)|(?<=\d)[^\S\n]+(?=\d|\.\d)|(?<=(?<=\w|\d)\d)[^\S\n]+(?=["'])|(?<=["'\d])[^\S\n]*,[^\S\n]*"""

It leaves commas inside strings, and handles numbers with a leading dot.

To get the output you want:

re.sub(regex, ",", original_string)

For a rough idea of performance [1], on an Ivy Bridge Celeron

import timeit

s = """\
import re

s = \"\"\"1,' unchanged 1' " unchanged  2 "   2.009,-2e15 35  "  fad!" '   dfgsdfg ' ,   'asdfasdf'  " fasf ,  , asfa" "2 fs", .085     .835\"\"\"
rgex = re.compile(r\"\"\"(?<=["'])\s+(?=["'])|(?<=["'])\s+(?=\d)|(?<=\d)\s+(?=\d|\.\d)|(?<=(?<=\w|\d)\d)\s+(?=["'])|(?<=["'\d])\s*,\s*\"\"\")

re.sub(rgex, ",", s)

"""

print("1k iterations: ", timeit.timeit(stmt=s, number=1000))
print("10k iterations: ", timeit.timeit(stmt=s, number=10000))
print("100k iterations: ", timeit.timeit(stmt=s, number=100000))
print("200k iterations: ", timeit.timeit(stmt=s, number=200000))
print("300k iterations: ", timeit.timeit(stmt=s, number=300000))

gives:

1k iterations:  0.0494868220000626
10k iterations:  0.4617418729999372
100k iterations:  4.604098313999884
200k iterations:  9.197777003000056
300k iterations:  13.79744054799994.

Interestingly, with the regex module, which is supposed to be more performant (as far as I understood), and which is supposed to replace the standard library re some time in the future, it's roughly two times slower.

[1]: It's not a realistic test as it just iterates on the string over and over, but I was in a hurry. Later tried a little better, with a string consisting of 200.000 and 300.000 lines (of the same string) and it came out roughly the same. ~8 seconds for 200.000 and ~12 seconds for 300.000.

Upvotes: 1

Related Questions