linluk
linluk

Reputation: 1670

matching quoted strings and unquoted words

I try to write a regular expression to match eihter strings surrounded by double quotes (") or words separated by space () and have them in a list in python.

I don't really understand the output of my code, can anybody give me a hint or explain what my regular expression is doing exactly?

here is my code:

import re
regex = re.compile('(\"[^\"]*\")|( [^ ]* )')
test = '"hello world." here are some words. "and more"'
print(regex.split(test))

I expect an output like this:

['"hello world."', ' here ', ' are ', ' some ', ' words. ', '"and more"']

but I get the following:

['', '"hello world."', None, '', None, ' here ', 'are', None, ' some ', 'words.', None, ' "and ', 'more"']

where does the empty strings and the Nones come from. and why does it match "hello world." but not "and more".

Thanks for your help, and a happy new year for those who celebrate it today!

EDIT:
to be precise: i dont need the surrounding spaces but i need the surrounding quotes. this output would be fine too:

['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']

EDIT2:

i ended up using shlex.split() like @PadraicCunningham suggested because it did exactly what i need and ihmo it is much more readable than regular expressions.

i still keep @TigerhawkT3's answer the accepted one because it solves the problem in the way i have asked it (with regular expressions).

Upvotes: 2

Views: 658

Answers (3)

Padraic Cunningham
Padraic Cunningham

Reputation: 180550

shlex.split with posix=False will do it for you:

import shlex

test = '"hello world." here are some words. "and more"'

print(shlex.split(test,posix=False))
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']

if you did not want the quotes, you would leave posix as True:

print(shlex.split(test))

['hello world.', 'here', 'are', 'some', 'words.', 'and more']

Upvotes: 2

TigerhawkT3
TigerhawkT3

Reputation: 49330

Include the quoted match first so it prioritizes that, and then non-whitespace characters:

>>> s = '"hello world." here are some words. "and more"'
>>> re.findall(r'"[^"]*"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']

You can get the same result with a non-greedy repeating pattern instead of the character set negation:

>>> re.findall(r'".*?"|\S+', s)
['"hello world."', 'here', 'are', 'some', 'words.', '"and more"']

Upvotes: 3

eumiro
eumiro

Reputation: 213125

Looks like CSV, so use the appropriate tools:

import csv
lines = ['"hello world." here are some words. "and more"']

list(csv.reader(lines, delimiter=' ', quotechar='"'))

returns

[['hello world.', 'here', 'are', 'some', 'words.', 'and more']]

Upvotes: 0

Related Questions