Reputation: 755
I want to mark the position of verbs in sentences by adding an 'X' before the verb. My function takes the following steps to achieve this.
pos
, where each word in the sentence is represented as a tag. L
. x
of the verb tag (e.g. "VBZ"
) in the POS list.x
into the sentence list. Step 4 assumes that the length of the list pos
is identical to the length of the sentence list L
. This is generally the case, except when spaCy assigns tags to sentence elements that Python does not index separately. In that case the POS list is longer than the sentence list. For example, spaCy sees a bracket '(' or full stop behind a word '.' as a separate position, whereas Python does not. As a result, the 'X' is misplaced in the sentence.
How to solve this?
Below is an example.
import pandas as pd
import spacy
nlp = spacy.load('en')
s = "Dr. John (a fictional chartacter) never shakes hands."
df = pd.DataFrame({'sentence':[s]})
k = df['sentence']
def marking(row):
L = row
sentence_spacy = nlp(L)
pos = [] # store the pos tags in a list 'pos'
for token in sentence_spacy:
pos.append(token.tag_)
print(pos)
if "VBZ" in pos:
x = pos.index("VBZ")
L = L.split()
L.insert(x, "X")
L = " ".join(L) # split the sentence also in a list
print(L)
return L
x = k.apply(marking)
print(x)
This gives:
pos = ['NNP', 'NNP', '-LRB-', 'DT', 'JJ', 'NN', '-RRB-', 'RB', 'VBZ', 'NNS', '.']
L = ['Dr.', 'John', '(a', 'fictional', 'chartacter)', 'never', 'shakes', 'hands.']
And because the pos-list pos
is longer than the sentence list L
, the result is:
x = "Dr. John (a fictional chartacter) never shakes hands. X"
But I want this:
x = "Dr. John (a fictional chartacter) never X shakes hands."
My question is two-fold:
Is it possible to exclude certain POS tags in spaCy? For example, can I exclude ['-LRB-', '-RRB-', etc.] ? This would make length pos == length L
If this is not possible, how should I change my function so that a list of POS tags can be specified ['-LRB-', '-RRB-', etc.]
that are deleted from pos
so that the length of the pos-list is identical to the length of sentence list?
Upvotes: 2
Views: 1505
Reputation: 3461
Tokenization is more complex than split. Even dropping tokens will not make split correspond to spaCy's tokens (try nlp('non-trivial')
). Fortunately there's a better way: you can reconstruct the sentence from the tokens and insert your mark at the desired point:
def marking(row):
chunks = []
for token in nlp(row):
if token.tag_ == 'VBZ':
chunks.append('X')
chunks.append(token.text_with_ws)
return ' '.join(chunks)
print(marking("Dr. John (a fictional chartacter) never shakes hands."))
Upvotes: 2