user10011655
user10011655

Reputation:

Text mining UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1671718: character maps to <undefined>

I have written code to create frequency table. but it is breaking at the line ext_string = document_text.read().lower(. I even put a try and except to catch the error but it is not helping.

import re
import string
frequency = {}
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
    try:
        count = frequency.get(word,0)
        frequency[word] = count + 1
    except UnicodeDecodeError:
        pass

frequency_list = frequency.keys()

for words in frequency_list:
    print (words, frequency[words])

Upvotes: 1

Views: 5378

Answers (2)

AkshayRY
AkshayRY

Reputation: 31

To read a file with some special characters, use encoding as 'latin1' or 'unicode_escape'

Upvotes: -1

Miquel Vande Velde
Miquel Vande Velde

Reputation: 184

You are opening your file twice, the second time without specifying the encoding:

file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')

You should open the file as follows:

frequencies = {}
with open('EVG_text mining.txt', encoding="utf8", mode='r') as f:
    text = f.read().lower()

match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
...

The second time you were opening your file, you were not defining what encoding to use which is probably why it errored. The with statement helps perform certain task linked with I/O for a file. You can read more about it here: https://www.pythonforbeginners.com/files/with-statement-in-python

You should probably have a look at error handling as well as you were not enclosing the line that was actually causing the error: https://www.pythonforbeginners.com/error-handling/

The code ignoring all decoding issues:

import re
import string  # Do you need this?

with open('EVG_text mining.txt', mode='rb') as f:  # The 'b' in mode changes the open() function to read out bytes.
    bytes = f.read()
    text = bytes.decode('utf-8', 'ignore') # Change 'ignore' to 'replace' to insert a '?' whenever it finds an unknown byte.

match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)

frequencies = {}
for word in match_pattern:  # Your error handling wasn't doing anything here as the error didn't occur here but when reading the file.
    count = frequencies.setdefault(word, 0)
    frequencies[word] = count + 1

for word, freq in frequencies.items():
    print (word, freq)

Upvotes: 2

Related Questions