Reputation: 6132
i'm learning a lot about Natural Language Processing with nltk, can do a lot of things, but I'm not being able to find the way to read Texts from the package. I have tried things like this:
from nltk.book import *
text6 #Brings the title of the text
open(text6).read()
#or
nltk.book.text6.read()
But it doesn't seem to work, because it has no fileid. No one seems to have asked this question before, so I assume the answer should be easy. Do you know what's the way to read those texts or how to convert them into a string? Thanks in advance
Upvotes: 3
Views: 5498
Reputation: 122112
Lets dig into the code =)
Firstly, the nltk.book
code resides on https://github.com/nltk/nltk/blob/develop/nltk/book.py
If we look carefully, the texts are loaded as an nltk.Text
objects, e.g. for text6
from https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36 :
text6 = Text(webtext.words('grail.txt'), name="Monty Python and the Holy Grail")
The Text
object comes from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286 , you can read more about how you can use it from http://www.nltk.org/book/ch02.html
The webtext
is a corpus from nltk.corpus
so to get to the raw text of nltk.book.text6
, you could load the webtext directly, e.g.
>>> from nltk.corpus import webtext
>>> webtext.raw('grail.txt')
The fileids
comes only when you load a PlaintextCorpusReader
object, not from the Text
object (processed object):
>>> type(webtext)
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
>>> for filename in webtext.fileids():
... print(filename)
...
firefox.txt
grail.txt
overheard.txt
pirates.txt
singles.txt
wine.txt
Upvotes: 8
Reputation: 2784
Looks like they already break it up into tokens for you.
from nltk.book import text6
text6.tokens
Upvotes: 2