Reputation: 5
I'm very new to coding and python so i'm really confuse with this Error. Here's my code from an exercise where i need to find the most used word into a directory with multiples files
import pathlib
directory = pathlib.Path('/Users/k/files/Code/exo')
stats ={}
for path in directory.iterdir():
file = open(str(path))
text = file.read().lower()
punctuation = (";", ".")
for mark in punctuation:
text = text.replace(mark, "")
for word in text.split():
if word in stats:
stats[word] = stats[word] + 1
else:
stats[word] = 1
most_used_word = None
score_max = 0
for word, score in stats.items():
if score > score_max:
score_max = score
most_used_word = word
print(word,"The most used word is : ", score_max)
here's what i get
Traceback (most recent call last):
File "test.py", line 9, in <module>
text = file.read().lower()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 568: invalid start byte
What could cause this error ?
Upvotes: 0
Views: 1644
Reputation: 199
Probably your file contain non-ascii characters, so you have to decode them in order to make the UnicodeDecodeError to disappear. You can try with reading in 'rb' mode, like this:
file = open(str(path), 'rb')
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
(From the docs)
Upvotes: 1