Reputation: 11
I have the following code that opens files in a directory, runs spaCy NLP on them, and the outputs dependency parse info into a file in a new directory.
import spacy, os
nlp = spacy.load('en')
path1 = 'C:/Path/to/my/input'
path2 = '../output'
for file in os.listdir(path1):
with open(file, encoding='utf-8') as text:
txt = text.read()
doc = nlp(txt)
for sent in doc.sents:
f = open(path2 + '/' + file, 'a+')
for token in sent:
f.write(file + '\t' + str(token.dep_) + '\t' + str(token.head) + '\t' + str(token.right_edge) + '\n')
f.close()
The trouble is that this won't preserver the order of the dependencies in the output file. I can't seem to find any references to character positions in the API documentation.
Upvotes: 0
Views: 112
Reputation: 4297
The character index is at token.idx
. The word index is at token.i
. I know this isn't particularly intuitive.
Tokens also compare by position, so you could do:
for child in sent:
word1, word2 = sorted((child, child.head))
This would get you each dependency arc, arranged in document order. I'm not sure what you're trying to do with the right edge there, though, so I'm not sure if this does quite what you want.
Upvotes: 1