Dadu
Dadu

Reputation: 109

a bytes-like object is required, not 'str': typeerror in compressed file

I am finding substring in compressed file using following python script. I am getting "TypeError: a bytes-like object is required, not 'str'". Please any one help me in fixing this.

from re import *
import re
import gzip
import sys
import io
import os

seq={}
with open(sys.argv[1],'r') as  fh:
    for line1 in fh:
            a=line1.split("\t")
            seq[a[0]]=a[1]
            abcd="AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG"
            print(a[0],"\t",seq[a[0]])

count={}
with gzip.open(sys.argv[2]) as gz_file:
    with io.BufferedReader(gz_file) as f:
            for line in f:
                    for b in seq:
                            if abcd in line:
                                    count[b] +=1

for c in count:
    print(c,"\t",count[c])

fh.close()
gz_file.close()
f.close()

and input files are

TruSeq2_SE AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG

the second file is compressed text file. The line "if abcd in line:" shows the error.

Upvotes: 0

Views: 5272

Answers (1)

jsbueno
jsbueno

Reputation: 110311

The "BufferedReader" class gives you bytestrings, not text strings - you can directly compare both objects in Python3 -

Since these strings just use a few ASCII characters and are not actually text, you can work all the way along with byte strings for your code.

So, whenever you "open" a file (not gzip.open), open it in binary mode (i.e. open(sys.argv[1],'rb') instead of 'r' to open the file)

And also prefix your hardcoded string with a b so that Python uses a binary string inernally: abcd=b"AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG" - this will avoid a similar error on your if abcd in line - though the error message should be different than the one you presented.

Alternativally, use everything as text - this can give you more methods to work with the strings (Python3's byte strigns are somewhat crippled) presentation of data when printing, and should not be much slower - in that case, instead of the changes suggested above, include an extra line to decode the line fetched from your data-file:

with io.BufferedReader(gz_file) as f:
    for line in f:
        line = line.decode("latin1")
        for b in seq:

(Besides the error, your progam logic seens to be a bit faulty, as you don't actually use a variable string in your innermost comparison - just the fixed bcd value - but I suppose you can fix taht once you get rid of the errors)

Upvotes: 1

Related Questions