Juan C
Juan C

Reputation: 6132

How to read nltk.text.Text files from nltk.book in Python?

i'm learning a lot about Natural Language Processing with nltk, can do a lot of things, but I'm not being able to find the way to read Texts from the package. I have tried things like this:

from nltk.book import *
text6 #Brings the title of the text
open(text6).read()
#or
nltk.book.text6.read()

But it doesn't seem to work, because it has no fileid. No one seems to have asked this question before, so I assume the answer should be easy. Do you know what's the way to read those texts or how to convert them into a string? Thanks in advance

Upvotes: 3

Views: 5498

Answers (3)

Johnny
Johnny

Reputation: 869

#generate sorted tokens

print(sorted(set(text6))

Upvotes: 0

alvas
alvas

Reputation: 122112

Lets dig into the code =)

Firstly, the nltk.book code resides on https://github.com/nltk/nltk/blob/develop/nltk/book.py

If we look carefully, the texts are loaded as an nltk.Text objects, e.g. for text6 from https://github.com/nltk/nltk/blob/develop/nltk/book.py#L36 :

text6 = Text(webtext.words('grail.txt'), name="Monty Python and the Holy Grail")

The Text object comes from https://github.com/nltk/nltk/blob/develop/nltk/text.py#L286 , you can read more about how you can use it from http://www.nltk.org/book/ch02.html

The webtext is a corpus from nltk.corpus so to get to the raw text of nltk.book.text6, you could load the webtext directly, e.g.

>>> from nltk.corpus import webtext
>>> webtext.raw('grail.txt')

The fileids comes only when you load a PlaintextCorpusReader object, not from the Text object (processed object):

>>> type(webtext)
<class 'nltk.corpus.reader.plaintext.PlaintextCorpusReader'>
>>> for filename in webtext.fileids():
...     print(filename)
... 
firefox.txt
grail.txt
overheard.txt
pirates.txt
singles.txt
wine.txt

Upvotes: 8

Jon
Jon

Reputation: 2784

Looks like they already break it up into tokens for you.

from nltk.book import text6

text6.tokens

Upvotes: 2

Related Questions