Reputation: 109
I am finding substring in compressed file using following python script. I am getting "TypeError: a bytes-like object is required, not 'str'". Please any one help me in fixing this.
from re import *
import re
import gzip
import sys
import io
import os
seq={}
with open(sys.argv[1],'r') as fh:
for line1 in fh:
a=line1.split("\t")
seq[a[0]]=a[1]
abcd="AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG"
print(a[0],"\t",seq[a[0]])
count={}
with gzip.open(sys.argv[2]) as gz_file:
with io.BufferedReader(gz_file) as f:
for line in f:
for b in seq:
if abcd in line:
count[b] +=1
for c in count:
print(c,"\t",count[c])
fh.close()
gz_file.close()
f.close()
and input files are
TruSeq2_SE AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG
the second file is compressed text file. The line "if abcd in line:" shows the error.
Upvotes: 0
Views: 5272
Reputation: 110311
The "BufferedReader" class gives you bytestrings, not text strings - you can directly compare both objects in Python3 -
Since these strings just use a few ASCII characters and are not actually text, you can work all the way along with byte strings for your code.
So, whenever you "open" a file (not gzip.open), open it in binary mode (i.e.
open(sys.argv[1],'rb')
instead of 'r'
to open the file)
And also prefix your hardcoded string with a b
so that Python uses a binary string inernally: abcd=b"AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG"
- this will avoid a similar error on your if abcd in line
- though the error message should be different than the one you presented.
Alternativally, use everything as text - this can give you more methods to work with the strings (Python3's byte strigns are somewhat crippled) presentation of data when printing, and should not be much slower - in that case, instead of the changes suggested above, include an extra line to decode the line fetched from your data-file:
with io.BufferedReader(gz_file) as f:
for line in f:
line = line.decode("latin1")
for b in seq:
(Besides the error, your progam logic seens to be a bit faulty, as you don't actually use a variable string in your innermost comparison - just the fixed bcd
value - but I suppose you can fix taht once you get rid of the errors)
Upvotes: 1