Handling data compression with non-ASCII values while reading and writing file

Question

I am trying to learn lossless compression algorithms using Python 3 and until now I have implemented huffman,burrow wheeler transform and move to front which can take up to 256 unique characters based on there ASCII values. So basically I am trying to read a UTF-8 text file and convert its characters to a single string, then alter that string to compress it. All the algorithms work perfectly but the problem lies in reading file with non-ASCII characters, because if I read the file without encoding it the data value of some special characters goes up to 8221 and movetofront algorithm gives this error:

ValueError: 8221 is not in list

To the read file I tried:

with open('test.txt','r',encoding='utf-8') as f:
    data = f.readlines()
charData = ''.join(str(x.encode('utf-8'))[2:-1] for x in data)
huffmanEncode(mtfEncoding(bwt_suffixArray(charData)))

Encode individual char and slice b'', bytes representation from it.

which converts this-> 'you’ll have to check'

to this-> 'you\xe2\x80\x99ll have to check'

Now I input this string, compress it, then decompress it. Decompression works perfectly and I get my string back that represents Unicode. My question is how to get the original content of file back, I tried:

print(bytes(decompressedStr).decode('utf-8'))
#Gives:
>>>TypeError: string argument without an encoding

and:

print(codecs.encode(str,decompressedStr).decode('utf-8'))
#Gives same exact string back:
>>>you\xe2\x80\x99ll have to check

Is there a more efficient way to do this? If not how to convert Unicode representing string to UTF-8 string?

Mark Tolonen · Accepted Answer

Compression algorithms work on bytes, which is what an encoded file contains. Open your original file in binary mode:

with open('test.txt','rb') as f:
    data = f.read()

Don't decode it to Unicode characters, whose ordinal values can be much larger than a byte. Compress the bytes, decompress the bytes, then decode the result to Unicode.

Full example:

#!python3
#coding:utf8
import lzma

text = '''Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!'''

# Create a file containing non-ASCII characters:
with open('test.txt','w',encoding='utf8') as f:
    f.write(text)

# Read the raw bytes data.
with open('test.txt','rb') as f:
    data = f.read()

# Note: The file write/read can be skipped by encoding the original Unicode text
#       to bytes manually.
#
# data = text.encode('utf8')

# Using a built-in Python compression/decompression algorithm.
compressed_data = lzma.compress(data)
decompressed_data = lzma.decompress(compressed_data)

print('orginial length =',len(data))
print('compressed length =',len(compressed_data))
print('decompressed length =',len(decompressed_data))
assert data == decompressed_data

# Now decode the byte data back to Unicode.
print(decompressed_data.decode('utf8'))

Output:

orginial length = 455
compressed length = 372
decompressed length = 455
Hola! Yo empecé aprendo Español hace dos mes en la escuela. Yo voy la universidad. Yo tratar estudioso Español tres hora todos los días para que yo saco mejor rápido. ¿Cosa algún yo debo hacer además construir mí vocabulario? Muchas veces yo estudioso la palabras solo para que yo construir mí voabulario rápido. Yo quiero empiezo leo el periódico Español la próxima semana. Por favor correcto algún la equivocaciónes yo hisciste. Gracias!

Handling data compression with non-ASCII values while reading and writing file

Answers (1)

Related Questions