Pandas read_csv error tokenizing text from Gutenberg project

Question

I am trying to create a Python wordcloud using a book from Project Gutenberg.

If I choose Jule Verne's book A Journey to the Centre of the Earth and download the Plain Text UTF-8 file, I get an error from pandas when I use read_csv.

This is the code I am using:

from wordcloud import WordCloud, STOPWORDS 
import matplotlib.pyplot as plt 
import pandas as pd 

df = pd.read_csv('pg18857.txt',delimiter=' ')

I get the following error message:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 14 fields in line 176, saw 15

I have tried several options in pd.read_csv, but I have not been able to parse the text.

Josh Friedlander · Accepted Answer

Pandas is designed for structured data. This means something organised into rows and columns, like a spreadsheet or a matrix. It'll give a text file a try, but loose text is far too disorganised for Pandas to parse.

What you might want to do is split it into a list of sentences, then feed that list into Pandas.

Here's a simple example:

with open('pg18857.txt') as f:
    content = f.readlines()
# Remove whitespace characters like `
` at the end of each line
content = [x.strip() for x in content] 
df = pd.DataFrame(content)

Pandas read_csv error tokenizing text from Gutenberg project

Answers (1)

Related Questions