Reputation: 896
This was the question I got from an onsite interview with a tech firm, and one that I think ultimately killed my chances.
You're given a sentence, and a dictionary that has words as keys and parts of speech as values.
The goal is to write a function in which when you're given a sentence, change each word to its part of speech given in the dictionary in order. We can assume that all the stuffs in sentence are present as keys in dictionary.
For instance, let's assume that we're given the following inputs:
sentence='I am done; Look at that, cat!'
dictionary={'!': 'sentinel', ',': 'sentinel',
'I': 'pronoun', 'am': 'verb',
'Look': 'verb', 'that': 'pronoun',
'at': 'preposition', ';': 'preposition',
'done': 'verb', ',': 'sentinel',
'cat': 'noun', '!': 'sentinel'}
output='pronoun verb verb sentinel verb preposition pronoun sentinel noun sentinel'
The tricky part was catching sentinels. If part of speech didn't have sentinels, this can be easily done. Is there an easy way of doing it? Any library?
Upvotes: 3
Views: 693
Reputation: 627
If you are looking for an non regular expressions based approach , you can try this:
def tag_pos(sentence):
output = []
for word in sentence.split():
if word not in dictionary:
literal = ''.join([char for char in word if not char.isalpha()])
word = ''.join([char for char in word if char.isalpha()])
output.append(dictionary[word])
if not len(literal)>1:
output.append(dictionary[literal])
else:
for literal in other:
output.append(dictionary[literal])
else:
output.append(dictionary[word])
return " ".join(output)
output = tag_pos(sentence)
print(output)
Upvotes: 2
Reputation: 138
Here's a less impressive but more explanatory solution:
Let's start with by defining the example dictionary and sentence in your question:
sentence = 'I am done; Look at that, cat!'
dictionary = {
'!': 'sentinel',
',': 'sentinel',
',': 'sentinel',
'I': 'pronoun',
'that': 'pronoun',
'cat': 'noun',
'am': 'verb',
'Look': 'verb',
'done': 'verb',
'at': 'preposition',
';': 'preposition',
}
For my solution, I define a recursive parsing function, aptly named parse
.
parse
first splits a sentence into words by spaces, then attempts to classify each word by looking it up in the provided dictionary.
If the word can't be found in the dictionary (because there's some punctuation attached to it, etc.), parse
then splits the word apart into its component tokens, and recursively parses it from there.
def parse(sentence, dictionary):
# split the words apart by whitespace
# some tokens may still be stuck together. (i.e. "that,")
words = sentence.split()
# this is a list of strings containing the 'category' of each word
output = []
for word in words:
if word in dictionary:
# base case, the word is in the dictionary
output.append(dictionary[word])
else:
# recursive case, the word still has tokens attached
# get all the tokens in the word
tokens = [key for key in dictionary.keys() if key in word]
# sort all the tokens by length - this makes sure big words are more likely to be preserved. (scat -> s, cat or sc, at) check
tokens.sort(key=len)
# this is where we'll store the output
sub_output = None
# iterate through the tokens to find if there's a valid way to split the word
for token in tokens:
try:
# pad the tokens inside each word
sub_output = parse(
word.replace(token, f" {token} "),
dictionary
)
# if the word is parsable, no need to try other combinations
break
except:
pass # the word couldn't be split
# make sure that the word was split - if it wasn't it's not a valid word and the sentence can't be parsed
assert sub_output is not None
output.append(sub_output)
# put it all together into a neat little string
return ' '.join(output)
Here's how you would use it:
# usage of parse
output = parse(sentence, dictionary)
# display the example output
print(output)
I hope my answer gave you some more insight into another method one might use to solve this problem.
Tada! 🎉
Upvotes: 2
Reputation: 1507
Python's Regular Expression package can be used to split the sentence into the tokens.
import re
sentence='I am done; Look at that, cat!'
dictionary={'!': 'sentinel', ',': 'sentinel',
'I': 'pronoun', 'am': 'verb',
'Look': 'verb', 'that': 'pronoun',
'at': 'preposition', ';': 'preposition',
'done': 'verb', ',': 'sentinel',
'cat': 'noun', '!': 'sentinel'}
tags = list()
for word in re.findall(r"[A-Za-z]+|\S", sentence):
tags.append(dictionary[word])
print (' '.join(tags))
Output
pronoun verb verb preposition verb preposition pronoun sentinel noun sentinel
The Regular expression [A-Za-z]+|\S
basically selects all the alphabets (capital and small) with their one or more occurance by [A-Za-z]+
, together with (done by |
, which means Alteration) all non white spaces by \s
.
Upvotes: 6