Reputation: 783
I am trying to create a Python wordcloud using a book from Project Gutenberg.
If I choose Jule Verne's book A Journey to the Centre of the Earth and download the Plain Text UTF-8 file, I get an error from pandas when I use read_csv.
This is the code I am using:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('pg18857.txt',delimiter=' ')
I get the following error message:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 14 fields in line 176, saw 15
I have tried several options in pd.read_csv, but I have not been able to parse the text.
Upvotes: 1
Views: 305
Reputation: 11657
Pandas is designed for structured data. This means something organised into rows and columns, like a spreadsheet or a matrix. It'll give a text file a try, but loose text is far too disorganised for Pandas to parse.
What you might want to do is split it into a list of sentences, then feed that list into Pandas.
Here's a simple example:
with open('pg18857.txt') as f:
content = f.readlines()
# Remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]
df = pd.DataFrame(content)
Upvotes: 1