krishna
krishna

Reputation: 609

Unable to read a file python

Hi I have a tar file containing files named 0_data, 0_index etc. What I am trying to do is open the tar file and read through the contents of these files. What I could do till now is extract all the files. What I could not do is read the contents of the individual files. I know that they are not plain text files, but if I cannot see the contents of the files, how can I parse the files which are a bunch of webpages?

The error I get when I try to open a file is:

return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 87: character maps to <undefined>

Here's my code:

import os
import tarfile

def is_tarfile(file):
return tarfile.is_tarfile(file)

def extract_tarfile(file):
    if is_tarfile(file):
        my_tarfile=tarfile.open(file)
        my_tarfile.extractall("c:/untar")
        read_files_nz2("c:/untar/nz2_merged");
        return 1
    return 0

def read_files_nz2(file):
    for subdir, dirs, files in os.walk(file):
        for i in files:
             path = os.path.join(subdir,i)
             print(path)
             content=open(path,'r')
             print (content.read())

extract_tarfile("c:/nz2.tar")

print(i) will output the name of the file, but print(content.read()) will give an error:

return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 87: character maps to <undefined>

I hope somebody can help me with reading data from the files

Upvotes: 0

Views: 3508

Answers (4)

Bassem Shahin
Bassem Shahin

Reputation: 706

I'm not sure the problem but this case happened for me and it solved using this encoding

with open(path, 'r', encoding="ISO-8859-1") as f:
    content = f.read()

another good way is to rewrite your file with UTF-8, check this code

with open(ff_name, 'rb') as source_file:
  with open(target_file_name, 'w+b') as dest_file:
    contents = source_file.read()
    dest_file.write(contents.decode('utf-16').encode('utf-8'))

Upvotes: 0

rici
rici

Reputation: 241671

You need to do one of two things:

  • specify an encoding when you open the file:

    # This is probably not the right encoding.
    content = open(path, 'r', encoding='utf-8')
    

    For that, you need to know what the encoding of the file is.

  • open the file in binary mode:

    content = open(path, 'rb')
    

    This will cause read to return a bytes object instead of a string, but it will avoid any attempt to decode or validate the individual bytes.

Upvotes: 1

Esdes
Esdes

Reputation: 1002

You need a full file path to access it, not just a name. Your second function should look like:

def read_files_nz2(file):
for subdir, dirs, files in os.walk(file):
    for i in files:
        path = os.path.join(subdir, f) # Getting full path to the file
        content=open(path,'r')
        print (content.read())

Upvotes: 1

Alejandro
Alejandro

Reputation: 1216

I'm not 100% sure this is your problem, but it is at least bad practice and a possible source of your problem.

You are not closing any files that you open. For example you have:

my_tarfile=tarfile.open(file)

But somewhere after that and before you open another file you should have:

my_tarfile.close()

Here's a quote from diveintopython:

Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's important to close files as soon as you're finished with them.

My thinking is that because you never close my_tarfile the system can't properly read the files extracted from. Even if it isn't the problem, it is good practice to close you files as soon as you can.

Upvotes: 1

Related Questions