Reputation: 24788
I need to create a lexer/parser which deals with input data of variable length and structure.
Say I have a list of reserved keywords:
keyWordList = ['command1', 'command2', 'command3']
and a user input string:
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command 3'
userInputList = userInput.split()
How would I go about writing this function:
INPUT:
tokenize(userInputList, keyWordList)
OUTPUT:
[['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command 2', ['the', 'lazy', 'dog'], 'command3']
I've written a tokenizer that can identify keywords, but have been unable to figure out an efficent way to embed groups of non-keywords into lists that are a level deeper.
RE solutions are welcome, but I would really like to see the underlying algorithm as I am probably going to extend the application to lists of other objects and not just strings.
Upvotes: 2
Views: 948
Reputation: 397
Or have a look at PyParsing. Quite a nice little lex parser combination
Upvotes: 1
Reputation: 236004
Try this:
keyWordList = ['command1', 'command2', 'command3']
userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
inputList = userInput.split()
def tokenize(userInputList, keyWordList):
keywords = set(keyWordList)
tokens, acc = [], []
for e in userInputList:
if e in keywords:
tokens.append(acc)
tokens.append(e)
acc = []
else:
acc.append(e)
if acc:
tokens.append(acc)
return tokens
tokenize(inputList, keyWordList)
> [['The', 'quick', 'brown'], 'command1', ['fox', 'jumped', 'over'], 'command2', ['the', 'lazy', 'dog'], 'command3']
Upvotes: 1
Reputation: 33397
That is easy to do with some regex:
>>> reg = r'(.+?)\s(%s)(?:\s|$)' % '|'.join(keyWordList)
>>> userInput = 'The quick brown command1 fox jumped over command2 the lazy dog command3'
>>> re.findall(reg, userInput)
[('The quick brown', 'command1'), ('fox jumped over', 'command2'), ('the lazy dog', 'command3')]
Now you just have to split the first element of each tuple.
For more than one level of deepness, regex may not be a good answer.
There are some nice parsers for you to choose on this page: http://wiki.python.org/moin/LanguageParsing
I think Lepl is a good one.
Upvotes: 2
Reputation: 363587
Something like this:
def tokenize(lst, keywords):
cur = []
for x in lst:
if x in keywords:
yield cur
yield x
cur = []
else:
cur.append(x)
This returns a generator, so wrap your call in one to list
.
Upvotes: 5