Reputation: 71
I have a code that builds n-gram model to test next word prediction based on a corpus provided. How can I replace the given corpus to read WSJ corpus as the training corpus ? A part of the program is given below.
# import libraries needed, read the dataset
import nltk, re, pprint, string
from nltk import word_tokenize, sent_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
string.punctuation = string.punctuation +'“'+'”'+'-'+'’'+'‘'+'—'
string.punctuation = string.punctuation.replace('.', '')
file = open('./corpus.txt', encoding = 'utf8').read()
#preprocess data
file_nl_removed = ""
for line in file:
line_nl_removed = line.replace("\n", " ")
file_nl_removed += line_nl_removed
file_p = "".join([char for char in file_nl_removed if char not in string.punctuation])
#nltk.download('punkt')
sents = nltk.sent_tokenize(file_p)
print("The number of sentences is", len(sents))
Upvotes: 0
Views: 671
Reputation: 2056
If you are going to use the WSJ
corpus from nltk
package it would be available after you download it:
import nltk
nltk.download('treebank')
from nltk.corpus import treebank
print(treebank.fileids()[:10])
print(treebank.words('wsj_0003.mrg')[:10])
output:
['wsj_0001.mrg', 'wsj_0002.mrg', 'wsj_0003.mrg', 'wsj_0004.mrg', 'wsj_0005.mrg', 'wsj_0006.mrg', 'wsj_0007.mrg', 'wsj_0008.mrg', 'wsj_0009.mrg', 'wsj_0010.mrg']
['A', 'form', 'of', 'asbestos', 'once', 'used', '*', '*', 'to', 'make']
Upvotes: 1