How to import and read a wsj corpus in python

Question

I have a code that builds n-gram model to test next word prediction based on a corpus provided. How can I replace the given corpus to read WSJ corpus as the training corpus ? A part of the program is given below.

# import libraries needed, read the dataset
import nltk, re, pprint, string
from nltk import word_tokenize, sent_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
string.punctuation = string.punctuation +'“'+'”'+'-'+'’'+'‘'+'—'
string.punctuation = string.punctuation.replace('.', '')
file = open('./corpus.txt', encoding = 'utf8').read()

#preprocess data
file_nl_removed = ""
for line in file:
  line_nl_removed = line.replace("
", " ")     
  file_nl_removed += line_nl_removed
file_p = "".join([char for char in file_nl_removed if char not in string.punctuation]) 

#nltk.download('punkt')
sents = nltk.sent_tokenize(file_p)
print("The number of sentences is", len(sents))

Meti · Accepted Answer

If you are going to use the WSJ corpus from nltk package it would be available after you download it:

import nltk
nltk.download('treebank')
from nltk.corpus import treebank
print(treebank.fileids()[:10])
print(treebank.words('wsj_0003.mrg')[:10])

output:

['wsj_0001.mrg', 'wsj_0002.mrg', 'wsj_0003.mrg', 'wsj_0004.mrg', 'wsj_0005.mrg', 'wsj_0006.mrg', 'wsj_0007.mrg', 'wsj_0008.mrg', 'wsj_0009.mrg', 'wsj_0010.mrg']
['A', 'form', 'of', 'asbestos', 'once', 'used', '*', '*', 'to', 'make']

How to import and read a wsj corpus in python

Answers (1)

Related Questions