Reputation: 23
I have sequence files and barcode files. The barcode files may have barcodes of any length that look like "ATTG, AGCT, ACGT" for example. The sequence files look like "ATTGCCCCCCCGGGGG, ATTGTTTTTTTT, AGCTAAAAA" for example. I need to match the barcodes to the sequences that contain them at the beginning. Then for each set of sequences with the same barcode I have to do calculations on them with the rest of the program (which is written already). I just dont know how to get them to match. Ive went through using print statements and The part where it is messed up is the "potential_barcode = line(:len(barcode)" line. Also, where it says #simple to fasta that is where I should be reading in the matched sequences. I'm pretty new at this so I probably made a lot of mistakes. Thanks for your help!
bcodefname = sys.argv[1]
infname = sys.argv[2]
barcodefile = open(bcodefname, "r")
for barcode in barcodefile:
barcode = barcode.strip()
print "barcode: %s" % barcode
outfname = "%s.%s" % (bcodefname,barcode)
# print outfname
outf = open("outfname", "w")
handle = open(infname, "r")
for line in handle:
potential_barcode = line[:len(barcode)]
print potential_barcode
if potential_barcode == barcode:
outseq = line[len(barcode):]
sys.stdout.write(outseq)
outf.write(outseq)
fastafname = infname + ".fasta"
print fastafname
mafftfname = fastafname + ".mafft"
stfname = mafftfname + ".stock"
print stfname
#simp to fasta#
# handle = open(infname, "r")
outf2 = open(fastafname, "w")
for line in handle:
linearr = line.split()
seqid = linearr[0]
seq = linearr[1]
outf2.write(">%s\n%s\n" % (seqid,seq))
# handle.close()
# outf.close()
#mafft#
cmd = "mafft %s > %s" % (fastafname,mafftfname)
sys.stderr.write("command: %s\n" % cmd)
os.system(cmd)
sys.stderr.write("command done\n")
Upvotes: 0
Views: 690
Reputation: 3137
I don't know exactly why this code isn't working for you. But here are some tips for improving the code:
You're reading the entire sequence file for every barcode. If there are 100 barcodes, you're reading through the sequence file 100 times. What you should do instead is read the barcode file once and create a list of barcodes.
First define a function we'll use to check for matches:
def matches_barcode(sequence, barcodes):
for bc in barcodes:
if sequence.startswith(bc):
return True
return False
(Note that I used startswith
instead of constructing a new string and comparing it; startswith
should be faster.)
Now read barcodes from the file:
barcodes = []
with open(bcodefname, "r") as barcodefile:
for barcode in barcodefile:
barcodes.append(barcode.strip())
(Note that I used with open...
; Your code leaks open files which could prevent the program from working if you have a lot of barcodes.)
Then read through the sequence file once, checking whether each sequence matches a barcode:
with open(infname, "r") as seqfile:
for sequence in seqfile:
if matches_barcode(sequence.strip(), barcodes):
# Got a match, continue processing.
This will be a lot faster because it does much less I/O: It reads 2 files instead of N + 1 files, where N is the number of barcodes. It's still a pretty naive algorithm, though: If it's too slow, you'll have to investigate more sophisticated algorithms for checking for a match.
If you still don't get the matches you expect, you'll need to debug: print out the exact strings being compared so you can see what's happening. It's a good idea to use repr
in these situations to really see the data, including whitespace and everything.
Upvotes: 1