Edward Lim
Edward Lim

Reputation: 823

Unicode Decode Error when trying to read data from a .txt file in Python

I am very new to python scripting but I have a very simple task that I would like to perform, but I seem to be stuck at it. All I am trying to accomplish is to read data from a .txt file and parse it.

Steps I have taken

  1. I have downloaded the pdf file from my schools website, it contains a list of courses http://info.sjsu.edu/cgi-bin/pdfserv?ftok=soc-fall-courses
  2. I converted the pdf file to a .txt file simply by saving it as a .txt file
  3. Googled the error to find out that it is some sort of encoding issue
  4. Used the terminal command file -I [filename] and returned the result sjsuclassdata.txt: text/plain; charset=unknown-8bit
  5. Used the many methods online to try and convert the file to a UTF-8 encoding but to no avail

Error Message that I got

Traceback (most recent call last):
  File "/Users/edward/MyPythonScripts/sjsuClassExtractor.py", line 25, in <module>
    regexMatches = lectureRegex.findall(file.read())
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 9: invalid continuation byte

So as you can see, I am really lost as to what Im supposed to do from here, I have verified that everything works if I read a different file that contains similar data.

Upvotes: 1

Views: 723

Answers (1)

Selcuk
Selcuk

Reputation: 59184

Assuming that the original text file is ANSI encoded (default with Acrobat Reader's 'Save As Text' option), this command will convert it to utf-8:

iconv -f "iso-8859-1" -t "utf-8" sjsuclassdata.txt -o sjsuclassdata-utf8.txt

Upvotes: 2

Related Questions