twhale
twhale

Reputation: 755

Is it possible to exclude certain POS tags in spaCy? Python

I want to mark the position of verbs in sentences by adding an 'X' before the verb. My function takes the following steps to achieve this.

  1. Locate the verb. I use spaCy for POS tagging. SpaCy outputs a list of POS tags that I call pos, where each word in the sentence is represented as a tag.
  2. Convert the sentence also into a list L.
  3. Determine the index x of the verb tag (e.g. "VBZ") in the POS list.
  4. Insert the desired 'X' mark at index x into the sentence list.

Step 4 assumes that the length of the list pos is identical to the length of the sentence list L. This is generally the case, except when spaCy assigns tags to sentence elements that Python does not index separately. In that case the POS list is longer than the sentence list. For example, spaCy sees a bracket '(' or full stop behind a word '.' as a separate position, whereas Python does not. As a result, the 'X' is misplaced in the sentence.

How to solve this?

Below is an example.

import pandas as pd
import spacy
nlp = spacy.load('en')

s = "Dr. John (a fictional chartacter) never shakes hands."
df = pd.DataFrame({'sentence':[s]})
k = df['sentence']

def marking(row):
    L = row
    sentence_spacy = nlp(L)
    pos = [] # store the pos tags in a list 'pos'
    for token in sentence_spacy:
        pos.append(token.tag_)
        print(pos)
    if "VBZ" in pos:
        x = pos.index("VBZ")
        L = L.split()
        L.insert(x, "X")
        L = " ".join(L) # split the sentence also in a list
        print(L)
        return L
x = k.apply(marking)
print(x)    

This gives:

pos = ['NNP', 'NNP', '-LRB-', 'DT', 'JJ', 'NN', '-RRB-', 'RB', 'VBZ', 'NNS', '.']
L = ['Dr.', 'John', '(a', 'fictional', 'chartacter)', 'never', 'shakes', 'hands.']

And because the pos-list pos is longer than the sentence list L, the result is:

 x = "Dr. John (a fictional chartacter) never shakes hands. X"

But I want this:

x = "Dr. John (a fictional chartacter) never X shakes hands."

My question is two-fold:

  1. Is it possible to exclude certain POS tags in spaCy? For example, can I exclude ['-LRB-', '-RRB-', etc.] ? This would make length pos == length L

  2. If this is not possible, how should I change my function so that a list of POS tags can be specified ['-LRB-', '-RRB-', etc.] that are deleted from pos so that the length of the pos-list is identical to the length of sentence list?

Upvotes: 2

Views: 1505

Answers (1)

emulbreh
emulbreh

Reputation: 3461

Tokenization is more complex than split. Even dropping tokens will not make split correspond to spaCy's tokens (try nlp('non-trivial')). Fortunately there's a better way: you can reconstruct the sentence from the tokens and insert your mark at the desired point:

def marking(row):
    chunks = []
    for token in nlp(row):
        if token.tag_ == 'VBZ':
            chunks.append('X')
        chunks.append(token.text_with_ws)
    return ' '.join(chunks)

print(marking("Dr. John (a fictional chartacter) never shakes hands."))

Upvotes: 2

Related Questions