Reputation: 1670
I try to write a regular expression to match eihter strings surrounded by double quotes ("
) or words separated by space () and have them in a list in python.
I don't really understand the output of my code, can anybody give me a hint or explain what my regular expression is doing exactly?
here is my code:
import re
regex = re.compile('(\"[^\"]*\")|( [^ ]* )')
test = '"hello world." here are some words. "and more"'
print(regex.split(test))
I expect an output like this:
['"hello world."', ' here ', ' are ', ' some ', ' words. ', '"and more"']
but I get the following:
['', '"hello world."', None, '', None, ' here ', 'are', None, ' some ', 'words.', None, ' "and ', 'more"']
where does the empty strings and the None
s come from.
and why does it match "hello world."
but not "and more"
.
Thanks for your help, and a happy new year for those who celebrate it today!
EDIT:
to be precise: i dont need the surrounding spaces but i need the surrounding quotes. this output would be fine too:
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
EDIT2:
i ended up using shlex.split()
like @PadraicCunningham suggested because it did exactly what i need and ihmo it is much more readable than regular expressions.
i still keep @TigerhawkT3's answer the accepted one because it solves the problem in the way i have asked it (with regular expressions).
Upvotes: 2
Views: 658
Reputation: 180550
shlex.split with posix=False
will do it for you:
import shlex
test = '"hello world." here are some words. "and more"'
print(shlex.split(test,posix=False))
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
if you did not want the quotes, you would leave posix as True:
print(shlex.split(test))
['hello world.', 'here', 'are', 'some', 'words.', 'and more']
Upvotes: 2
Reputation: 49330
Include the quoted match first so it prioritizes that, and then non-whitespace characters:
>>> s = '"hello world." here are some words. "and more"'
>>> re.findall(r'"[^"]*"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
You can get the same result with a non-greedy repeating pattern instead of the character set negation:
>>> re.findall(r'".*?"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']
Upvotes: 3
Reputation: 213125
Looks like CSV, so use the appropriate tools:
import csv
lines = ['"hello world." here are some words. "and more"']
list(csv.reader(lines, delimiter=' ', quotechar='"'))
returns
[['hello world.', 'here', 'are', 'some', 'words.', 'and more']]
Upvotes: 0