Reputation: 107
I'm trying to parse a user input that each word/name/number gets seperated by whitespace (except for strings which are defined by double quotes) and gets pushed into a list. The list gets printed along the way. I previously made a version of this code but this time I want to used Tokens to make things cleaner. Here's what I have so far but it's not printing anything.
#!/util/bin/python
import re
def main ():
for i in tokenizer('abcd xvc 23432 "exampe" 366'):
print (i);
tokens = (
('STRING', re.compile('"[^"]+"')), # longest match
('NAME', re.compile('[a-zA-Z_]+')),
('SPACE', re.compile('\s+')),
('NUMBER', re.compile('\d+')),
)
def tokenizer(s):
i = 0
lexeme = []
while i < len(s):
match = False
for token, regex in tokens:
result = regex.match(s, i)
if result:
lexeme.append((token, result.group(0)))
i = result.end()
match = True
break
if not match:
raise Exception('lexical error at {0}'.format(i))
return lexeme
main()
Upvotes: 0
Views: 1702
Reputation: 40733
I suggest to use the shlex
module for breaking up quoted string:
>>> import shlex
>>> s = 'hello "quoted string" 123 \'More quoted string\' end'
>>> s
'hello "quoted string" 123 \'More quoted string\' end'
>>> shlex.split(s)
['hello', 'quoted string', '123', 'More quoted string', 'end']
After that, you can classify all your tokens (string, number, ...) as you want. The only thing you are missing is space: shlex does not care about space.
Here is a simple demo:
import shlex
if __name__ == '__main__':
line = 'abcd xvc 23432 "exampe" 366'
tokens = shlex.split(line)
for token in tokens:
print '>{}<'.format(token)
Output:
>abcd<
>xvc<
>23432<
>exampe<
>366<
If you insist on not stripping the quote marks, then call split() with posix=False
:
tokens = shlex.split(line, posix=False)
Output:
>abcd<
>xvc<
>23432<
>"exampe"<
>366<
Upvotes: 2
Reputation: 59604
I think your indentation is broken, this:
#!/util/bin/python
import re
tokens = (
('STRING', re.compile('"[^"]+"')), # longest match
('NAME', re.compile('[a-zA-Z_]+')),
('SPACE', re.compile('\s+')),
('NUMBER', re.compile('\d+')),
)
def main ():
for i in tokenizer('abcd xvc 23432 "exampe" 366'):
print (i);
def tokenizer(s):
i = 0
lexeme = []
while i < len(s):
match = False
for token, regex in tokens:
result = regex.match(s, i)
if result:
lexeme.append((token, result.group(0)))
i = result.end()
match = True
break
if not match:
raise Exception('lexical error at {0}'.format(i))
return lexeme
main()
prints:
('NAME', 'abcd')
('SPACE', ' ')
('NAME', 'xvc')
('SPACE', ' ')
('NUMBER', '23432')
('SPACE', ' ')
('STRING', '"exampe"')
('SPACE', ' ')
('NUMBER', '366')
Upvotes: 1