Reputation: 151
This is my dataset:
emma=gutenberg.sents('austen-emma.txt')
It gives me the sentences
[[u'she',u'was',u'happy',[u'It',u'was',u'her',u'own',u'good']]
But this is what I want to get:
['she was happy','It was her own good']
Upvotes: 3
Views: 4774
Reputation: 2904
As noted by alvas and AShelly, what you see is the correct behavior. However, their approaches of just joining the words of each sentence have two drawbacks:
"Emma Woodhouse , handsome , clever , and rich , with a comfortable [...]"
).PlaintextCorpusReader
perform sentence tokenization just for reverting it afterwards, which is avoidable computational overhead.Given the implementation of PlaintextCorpusReader
, it is easy to derive a function that takes exactly the same steps as PlaintextCorpusReader.sents()
, but without the sentence tokenization:
def sentences_from_corpus(corpus, fileids = None):
from nltk.corpus.reader.plaintext import read_blankline_block, concat
def read_sent_block(stream):
sents = []
for para in corpus._para_block_reader(stream):
sents.extend([s.replace('\n', ' ') for s in corpus._sent_tokenizer.tokenize(para)])
return sents
return concat([corpus.CorpusView(path, read_sent_block, encoding=enc)
for (path, enc, fileid)
in corpus.abspaths(fileids, True, True)])
In contrast to what I said above, there is one additional step performed by this function: Since we're not doing word tokenization any more, we have to replace newlines with white-spaces.
Passing the gutenberg
corpus to this function results in:
['[Emma by Jane Austen 1816]',
'VOLUME I',
'CHAPTER I',
'Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.',
"She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.",
...]
Upvotes: 3
Reputation: 122112
The corpora accessed using the nltk.corpus
API normally returns a document stream, i.e. a list of sentences, each sentence is a list of tokens.
>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.sents('austen-emma.txt')
>>> emma[0]
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', u']']
>>> emma[1]
[u'VOLUME', u'I']
>>> emma[2]
[u'CHAPTER', u'I']
>>> emma[3]
[u'Emma', u'Woodhouse', u',', u'handsome', u',', u'clever', u',', u'and', u'rich', u',', u'with', u'a', u'comfortable', u'home', u'and', u'happy', u'disposition', u',', u'seemed', u'to', u'unite', u'some', u'of', u'the', u'best', u'blessings', u'of', u'existence', u';', u'and', u'had', u'lived', u'nearly', u'twenty', u'-', u'one', u'years', u'in', u'the', u'world', u'with', u'very', u'little', u'to', u'distress', u'or', u'vex', u'her', u'.']
For the nltk.corpus.gutenberg
corpus, it loads the PlaintextCorpusReader
, see
https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L114
and https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py
So it's reading a directory of text files, one of which is 'austen-emma.txt'
and it uses the default sent_tokenize
and word_tokenize
function to process the corpus. In the code it's instantiated as tokenizers/punkt/english.pickle
and WordPunctTokenizer()
, see https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L40
So to get a list of sentence strings as desired, use:
>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.sents('austen-emma.txt')
>>> sents_list = [" ".join(sent) for sent in emma]
>>> sents_list[0]
u'[ Emma by Jane Austen 1816 ]'
>>> sents_list[1]
u'VOLUME I'
>>> sents_list[:1]
[u'[ Emma by Jane Austen 1816 ]']
>>> sents_list[:2]
[u'[ Emma by Jane Austen 1816 ]', u'VOLUME I']
>>> sents_list[:3]
[u'[ Emma by Jane Austen 1816 ]', u'VOLUME I', u'CHAPTER I']
Upvotes: 1
Reputation: 35540
You appear to be getting correct output, according to the nltk docs:
sents(fileids=None)[source]¶ Returns: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings.
So you just need to turn your list of word strings back into a space-separated sentence:
sentences = [" ".join(list_of_words) for list_of_words in emma]
Upvotes: 3