Reputation: 13
I'm trying to do tokenization with spacy. I'm new to python and I want to know how to do tokenization to a csv file.
T have opened the file in Jupyter notebook:
import csv
import wheel
with open('/Users/Desktop/Python Path copia/samsungs10.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';')
for riga in csv_reader:
for campo in riga:
print(campo, end=" ")
print("") #fine riga
doc = nlp ('csv_file')
And the output is correctly the csv dataset.
Trying to tokenize I have this issue:
#python3 -m spacy download en_core_web_sm
import spacy
import en_core_web_sm
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
The output is:
csv_file csv_file ADP IN ROOT xxx_xxxx False False.
WHY?
Upvotes: 0
Views: 2132
Reputation: 2126
Calling the nlp
object on a string of text will return a processed doc, you need to change
doc = nlp ('csv_file')
to the text contents of your csv reader e.g.
doc = nlp(csv_contents)
Edit: In your example you have a collection of rows from a csv file. You can still use nlp to process strings row by row. Here is one way to do it:
import csv
import spacy
nlp = spacy.load("en_core_web_lg")
doc = []
with open('file.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=';')
for riga in csv_reader:
for campo in riga:
print(campo)
doc.append(nlp(campo))
for item in doc:
for token in item:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
Upvotes: 1