Reputation: 383
I am using Spacy nlp.pipe() for getting doc objects for text data in pandas Dataframe column but the parsed text returned as "text" in the code has length of only 32. However, the shape of dataframe is (14640, 16). Here is the data link if someone wants to read the data.
nlp = spacy.load("en_core_web_sm")
for text in nlp.pipe(iter(df['text']), batch_size = 1000, n_threads=-1):
print(text)
len(text)
Result:
32
Can someone help me with this what is going on? What I am doing wrong?
Upvotes: 2
Views: 2995
Reputation: 12992
According to the Spacy Documentation of Doc
object here, the __len__
operator gets "the number of tokens in the document.".
The last text in your data is:
>>> df['text'].values[-1]
@AmericanAir we have 8 ppl so we need 2 know how many seats are on the next flight. Plz put us on standby for 4 people on the next flight?
After running the nlp.pipe()
method, this sentence will be tokenized into 32 tokens which what you're asking for. To verfiy that, try runn the following code after len(text)
and will get the exact result:
>>> last_tokens = [token for token in text]
>>> last_tokens
[@AmericanAir, we, have, 8, ppl, so, we, need, 2, know, how, many, seats, are, on, the, next, flight, ., Plz, put, us, on, standby, for, 4, people, on, the, next, flight, ?]
>>> len(last_tokens)
32
You can iterate over the tokens of each doc
returned from the pipeline like so:
nlp = spacy.load("en_core_web_sm")
for text in nlp.pipe(iter(df['text']), batch_size = 1000, n_threads=-1):
for token in text:
print(token)
print('\n')
Upvotes: 1