Reputation: 122338
I understand that nltk can split sentences and print it out using the following code. but how do i put the sentences into a list instead of outputing onto the screen?
import nltk.data
from nltk.tokenize import sent_tokenize
import os, sys, re, glob
cwd = './extract_en' #os.getcwd()
for infile in glob.glob(os.path.join(cwd, 'fileX.txt')):
(PATH, FILENAME) = os.path.split(infile)
read = open(infile)
for line in read:
sent_tokenize(line)
the sent_tokenize(line) prints it out. how do i put it into a list?
Upvotes: 0
Views: 5835
Reputation: 151187
Here's a simplified version that I used to test the code:
import nltk.data
from nltk.tokenize import sent_tokenize
import sys
infile = open(sys.argv[1])
slist = []
for line in infile:
slist.append(sent_tokenize(line))
print slist
infile.close()
When called like so, it prints the following:
me@mine:~/src/ $ python nltkplay.py nltkplay.py
[['import nltk.data\n'], ['from nltk.tokenize import sent_tokenize\n'], ['import sys\n'], ['infile = open(sys.argv[1])\n'], ['slist = []\n'], ['for line in infile:\n'], [' slist.append(sent_tokenize(line))\n'], ['print slist\n'], ['\n']]
When doing something like this, a list comprehension is more concise and IMO more pleasant to read:
slist = [sent_tokenize(line) for line in infile]
To clarify, the above returns a list of lists of sentences, one list of sentences for each line. If you want a flat list of sentences, do this instead, as eyquem suggests:
slist = sent_tokenize(infile.read())
Upvotes: 2
Reputation: 27585
You must not use a keyword name (read) to name an object of your programm.
.
If you want to append in a list, you must have a list:
reclist = []
for line in f:
reclist.append(line)
or with a list comprehension
reclist = [ line for line in f ]
or using the tools of Python
reclist = f.readlines()
or I didn't understand what you want
EDIT:
Well, considering the Jochen Ritzel 's remark, you want
f = open(infile)
reclist = sent_tokenise(f.read())
Upvotes: 1