Reputation: 609
Hi I have a tar file containing files named 0_data
, 0_index
etc. What I am trying to do is open the tar file and read through the contents of these files. What I could do till now is extract all the files. What I could not do is read the contents of the individual files. I know that they are not plain text files, but if I cannot see the contents of the files, how can I parse the files which are a bunch of webpages?
The error I get when I try to open a file is:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 87: character maps to <undefined>
Here's my code:
import os
import tarfile
def is_tarfile(file):
return tarfile.is_tarfile(file)
def extract_tarfile(file):
if is_tarfile(file):
my_tarfile=tarfile.open(file)
my_tarfile.extractall("c:/untar")
read_files_nz2("c:/untar/nz2_merged");
return 1
return 0
def read_files_nz2(file):
for subdir, dirs, files in os.walk(file):
for i in files:
path = os.path.join(subdir,i)
print(path)
content=open(path,'r')
print (content.read())
extract_tarfile("c:/nz2.tar")
print(i)
will output the name of the file, but print(content.read())
will give an error:
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 87: character maps to <undefined>
I hope somebody can help me with reading data from the files
Upvotes: 0
Views: 3508
Reputation: 706
I'm not sure the problem but this case happened for me and it solved using this encoding
with open(path, 'r', encoding="ISO-8859-1") as f:
content = f.read()
another good way is to rewrite your file with UTF-8, check this code
with open(ff_name, 'rb') as source_file:
with open(target_file_name, 'w+b') as dest_file:
contents = source_file.read()
dest_file.write(contents.decode('utf-16').encode('utf-8'))
Upvotes: 0
Reputation: 241671
You need to do one of two things:
specify an encoding when you open the file:
# This is probably not the right encoding.
content = open(path, 'r', encoding='utf-8')
For that, you need to know what the encoding of the file is.
open the file in binary mode:
content = open(path, 'rb')
This will cause read to return a bytes object instead of a string, but it will avoid any attempt to decode or validate the individual bytes.
Upvotes: 1
Reputation: 1002
You need a full file path to access it, not just a name. Your second function should look like:
def read_files_nz2(file):
for subdir, dirs, files in os.walk(file):
for i in files:
path = os.path.join(subdir, f) # Getting full path to the file
content=open(path,'r')
print (content.read())
Upvotes: 1
Reputation: 1216
I'm not 100% sure this is your problem, but it is at least bad practice and a possible source of your problem.
You are not closing any files that you open. For example you have:
my_tarfile=tarfile.open(file)
But somewhere after that and before you open another file you should have:
my_tarfile.close()
Here's a quote from diveintopython:
Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's important to close files as soon as you're finished with them.
My thinking is that because you never close my_tarfile the system can't properly read the files extracted from. Even if it isn't the problem, it is good practice to close you files as soon as you can.
Upvotes: 1