Reputation: 185
I want to read a file with data, coded in hex format:
01ff0aa121221aff110120...etc
the files contains >100.000 such bytes, some more than 1.000.000 (they comes form DNA sequencing)
I tried the following code (and other similar):
filele=1234563
f=open('data.geno','r')
c=[]
for i in range(filele):
a=f.read(1)
b=a.encode("hex")
c.append(b)
f.close()
This gives each byte separate "aa" "01" "f1" etc, that is perfect for me!
This works fine up to (in this case) byte no 905 that happen to be "1a". I also tried the ord() function that also stopped at the same byte.
There might be a simple solution?
Upvotes: 11
Views: 70747
Reputation: 2627
If the file is encoded in hex format, shouldn't each byte be represented by 2 characters? So
c=[]
with open('data.geno','rb') as f:
b = f.read(2)
while b:
c.append(b.decode('hex'))
b=f.read(2)
or you can even do
with open('data.geno','rb') as f:
c = list(f.read().decode('hex'))
for example (in python 2.7.18), this works
>>> list(b'404040'.decode('hex'))
['@', '@', '@']
This won't work in Python 3. In Python you would use the codecs module:
import codecs
with open('data.geno','rb') as f:
c = list(map(chr, codecs.decode(f.read(), 'hex')))
or (depending on whether you are looking for them as number or as characters)
import codecs
with open('data.geno','rb') as f:
c = list(codecs.decode(f.read(), 'hex'))
because in Python 3,
>>> import codecs
>>> codecs.decode(b'404040', 'hex')
b'@@@'
>>> list(codecs.decode(b'404040', 'hex'))
[64, 64, 64]
>>> list(map(chr, codecs.decode(b'404040', 'hex')))
['@', '@', '@']
or even ''.join(map(chr, codecs.decode(f.read(), 'hex')))
if you want a string instead of a list.
>>> ''.join(map(chr, codecs.decode(b'404040', 'hex')))
'@@@'
Upvotes: 2
Reputation: 109
Just an additional note to these, make sure to add a break into your .read of the file or it will just keep going.
def HexView():
with open(<yourfilehere>, 'rb') as in_file:
while True:
hexdata = in_file.read(16).hex() # I like to read 16 bytes in then new line it.
if len(hexdata) == 0: # breaks loop once no more binary data is read
break
print(hexdata.upper()) # I also like it all in caps.
Upvotes: 3
Reputation: 155363
Simple solution is binascii
:
import binascii
# Open in binary mode (so you don't read two byte line endings on Windows as one byte)
# and use with statement (always do this to avoid leaked file descriptors, unflushed files)
with open('data.geno', 'rb') as f:
# Slurp the whole file and efficiently convert it to hex all at once
hexdata = binascii.hexlify(f.read())
This just gets you a str
of the hex values, but it does it much faster than what you're trying to do. If you really want a bunch of length 2 strings of the hex for each byte, you can convert the result easily:
hexlist = map(''.join, zip(hexdata[::2], hexdata[1::2]))
which will produce the list of len 2 str
s corresponding to the hex encoding of each byte. To avoid temporary copies of hexdata
, you can use a similar but slightly less intuitive approach that avoids slicing by using the same iterator twice with zip
:
hexlist = map(''.join, zip(*[iter(hexdata)]*2))
Update:
For people on Python 3.5 and higher, bytes
objects spawned a .hex()
method, so no module is required to convert from raw binary data to ASCII hex. The block of code at the top can be simplified to just:
with open('data.geno', 'rb') as f:
hexdata = f.read().hex()
Upvotes: 32
Reputation: 185
Thanks for all interesting answers!
The simple solution that worked immediately, was to change "r" to "rb", so:
f=open('data.geno','r') # don't work
f=open('data.geno','rb') # works fine
The code in this case is actually only two binary bites, so one byte contains four data, binary; 00, 01, 10, 11.
Yours!
Upvotes: 0