Reputation: 21
I am new to spaCy and NLTK as a whole, so I apologize in advance if this seems to be a dumb question.
Based on spaCy tutorial,I have to use the following command to load text into a doc.
doc = nlp(u'Hello, world. Natural Language Processing in 10 lines of code.')
However, I have a lot of text stored in tabular format on sql server or excel. It basically has two columns. First column has an unique identifier. Second column has a short text.
How do I load them into spaCy? Do I need to convert them into a Numpy array or Pandas dataframe and then load it into the doc?
Thanks in advance for your help!
Upvotes: 2
Views: 7930
Reputation: 8954
I think alexis's comment to use pandas
.apply()
is the best answer, this worked great for me:
import spacy
df = pd.read_csv('doc filename.txt')
df['text_as_spacy_objects'] = df['text column name'].apply(nlp)
Upvotes: 1
Reputation: 122092
Given a csv file like this:
$ cat test.tsv
DocID Text WhateverAnnotations
1 Foo bar bar dot dot dot
2 bar bar black sheep dot dot dot dot
$ cut -f2 test.tsv
Text
Foo bar bar
bar bar black sheep
And in code:
$ python
>>> import pandas as pd
>>> pd.read_csv('test.tsv', delimiter='\t')
DocID Text WhateverAnnotations
0 1 Foo bar bar dot dot dot
1 2 bar bar black sheep dot dot dot dot
>>> df = pd.read_csv('test.tsv', delimiter='\t')
>>> df['Text']
0 Foo bar bar
1 bar bar black sheep
Name: Text, dtype: object
To use the pipe
in spacy:
>>> import spacy
>>> nlp = spacy.load('en')
>>> for parsed_doc in nlp.pipe(iter(df['Text']), batch_size=1, n_threads=4):
... print (parsed_doc[0].text, parsed_doc[0].tag_)
...
Foo NNP
bar NN
To use pandas.DataFrame.apply()
:
>>> df['Parsed'] = df['Text'].apply(nlp)
>>> df['Parsed'].iloc[0]
Foo bar bar
>>> type(df['Parsed'].iloc[0])
<class 'spacy.tokens.doc.Doc'>
>>> df['Parsed'].iloc[0][0].tag_
'NNP'
>>> df['Parsed'].iloc[0][0].text
'Foo'
To benchmark.
First duplicate the rows 2 million times:
$ cat test.tsv
DocID Text WhateverAnnotations
1 Foo bar bar dot dot dot
2 bar bar black sheep dot dot dot dot
$ tail -n 2 test.tsv > rows2
$ perl -ne 'print "$_" x1000000' rows2 > rows2000000
$ cat test.tsv rows2000000 > test-2M.tsv
$ wc -l test-2M.tsv
2000003 test-2M.tsv
$ head test-2M.tsv
DocID Text WhateverAnnotations
1 Foo bar bar dot dot dot
2 bar bar black sheep dot dot dot dot
1 Foo bar bar dot dot dot
1 Foo bar bar dot dot dot
1 Foo bar bar dot dot dot
1 Foo bar bar dot dot dot
1 Foo bar bar dot dot dot
1 Foo bar bar dot dot dot
1 Foo bar bar dot dot dot
[nlppipe.py]:
import time
import pandas as pd
import spacy
df = pd.read_csv('test-2M.tsv', delimiter='\t')
nlp = spacy.load('en')
start = time.time()
for parsed_doc in nlp.pipe(iter(df['Text']), batch_size=1000, n_threads=4):
x = parsed_doc[0].tag_
print (time.time() - start)
[dfapply.py]:
import time
import pandas as pd
import spacy
df = pd.read_csv('test-2M.tsv', delimiter='\t')
nlp = spacy.load('en')
start = time.time()
df['Parsed'] = df['Text'].apply(nlp)
for doc in df['Parsed']:
x = doc[0].tag_
print (time.time() - start)
Upvotes: 7
Reputation: 7105
This should be pretty simple – you can use any method you want to read your texts from your database (Pandas dataframe, a CSV reader etc.) and then iterate over them.
It ultimately depends on what you want to do and how you want to process your text – if you want to process each text individually, simply iterate over your data line by line:
for id, line in text:
doc = nlp(line)
# do something with each text
Alternatively, you can also join the texts into one string and process them as one document:
text = open('some_large_text_file.txt').read()
doc = nlp(text)
For a more advanced usage example, see this code snippet of streaming input and output using pipe()
.
Upvotes: 0