Reputation: 896

Split sentence into words and non-white characters for POS Tagging

This was the question I got from an onsite interview with a tech firm, and one that I think ultimately killed my chances.

You're given a sentence, and a dictionary that has words as keys and parts of speech as values.

The goal is to write a function in which when you're given a sentence, change each word to its part of speech given in the dictionary in order. We can assume that all the stuffs in sentence are present as keys in dictionary.

For instance, let's assume that we're given the following inputs:

sentence='I am done; Look at that, cat!' 

dictionary={'!': 'sentinel', ',': 'sentinel', 
            'I': 'pronoun', 'am': 'verb', 
            'Look': 'verb', 'that': 'pronoun', 
             'at': 'preposition', ';': 'preposition', 
             'done': 'verb', ',': 'sentinel', 
             'cat': 'noun', '!': 'sentinel'}

output='pronoun verb verb sentinel verb preposition pronoun sentinel noun sentinel'

The tricky part was catching sentinels. If part of speech didn't have sentinels, this can be easily done. Is there an easy way of doing it? Any library?

Upvotes: 3

Answers (3)

Faizan Naseer

Reputation: 627

If you are looking for an non regular expressions based approach , you can try this:

def tag_pos(sentence):
    output = []
    for word in sentence.split():
        if word not in dictionary:
            literal = ''.join([char for char in word if not char.isalpha()])
            word = ''.join([char for char in word if char.isalpha()])
            output.append(dictionary[word])
            if not len(literal)>1:

                output.append(dictionary[literal])
            else:

                for literal in other:
                    output.append(dictionary[literal])
        else:
            output.append(dictionary[word])

    return " ".join(output)


output = tag_pos(sentence)
print(output)

Upvotes: 2

i..

Reputation: 138

Here's a less impressive but more explanatory solution:

Let's start with by defining the example dictionary and sentence in your question:

sentence = 'I am done; Look at that, cat!' 

dictionary = {
    '!':    'sentinel', 
    ',':    'sentinel', 
    ',':    'sentinel', 
    'I':    'pronoun', 
    'that': 'pronoun', 
    'cat':  'noun', 
    'am':   'verb', 
    'Look': 'verb', 
    'done': 'verb', 
    'at':   'preposition', 
    ';':    'preposition', 
}

For my solution, I define a recursive parsing function, aptly named parse. parse first splits a sentence into words by spaces, then attempts to classify each word by looking it up in the provided dictionary. If the word can't be found in the dictionary (because there's some punctuation attached to it, etc.), parse then splits the word apart into its component tokens, and recursively parses it from there.

def parse(sentence, dictionary):
  # split the words apart by whitespace
  # some tokens may still be stuck together. (i.e. "that,")
  words = sentence.split() 

  # this is a list of strings containing the 'category' of each word
  output = [] 

  for word in words:
    if word in dictionary:
      # base case, the word is in the dictionary
      output.append(dictionary[word])
    else:
      # recursive case, the word still has tokens attached

      # get all the tokens in the word
      tokens = [key for key in dictionary.keys() if key in word]

      # sort all the tokens by length - this makes sure big words are more likely to be preserved. (scat -> s, cat or sc, at) check 
      tokens.sort(key=len)

      # this is where we'll store the output 
      sub_output = None

      # iterate through the tokens to find if there's a valid way to split the word
      for token in tokens:
        try: 

          # pad the tokens inside each word
          sub_output = parse(
            word.replace(token, f" {token} "), 
            dictionary
          )

          # if the word is parsable, no need to try other combinations
          break
        except: 
          pass # the word couldn't be split

      # make sure that the word was split - if it wasn't it's not a valid word and the sentence can't be parsed
      assert sub_output is not None

      output.append(sub_output)

  # put it all together into a neat little string
  return ' '.join(output)

Here's how you would use it:

# usage of parse
output = parse(sentence, dictionary)

# display the example output
print(output)

I hope my answer gave you some more insight into another method one might use to solve this problem.

Tada! 🎉

Upvotes: 2

Divyanshu Srivastava

Reputation: 1507

Python's Regular Expression package can be used to split the sentence into the tokens.

import re
sentence='I am done; Look at that, cat!' 

dictionary={'!': 'sentinel', ',': 'sentinel', 
            'I': 'pronoun', 'am': 'verb', 
            'Look': 'verb', 'that': 'pronoun', 
             'at': 'preposition', ';': 'preposition', 
             'done': 'verb', ',': 'sentinel', 
             'cat': 'noun', '!': 'sentinel'}

tags = list()
for word in re.findall(r"[A-Za-z]+|\S", sentence):
    tags.append(dictionary[word])

print (' '.join(tags))

Output

pronoun verb verb preposition verb preposition pronoun sentinel noun sentinel

The Regular expression [A-Za-z]+|\S basically selects all the alphabets (capital and small) with their one or more occurance by [A-Za-z]+, together with (done by |, which means Alteration) all non white spaces by \s.

Upvotes: 6

Split sentence into words and non-white characters for POS Tagging

Answers (3)

Related Questions