Reputation: 1096
Please give your hints - I have spent a lot of time on this code but have not been able to solve it.
I have a code that gives output in a format I want however the results it shows are not 100% correct.
I have two big FASTA files that look like below (information after # are not in the actual file - I put them in this question for better understanding):
f1.fa
>seq1
ATCGCGT #--> 6mers are ATCGCG and TCGCGT // reverse complements are CGCGAT and ACGCGA
>seq20
CCATGAA #--> 6mers are CCATGA and CATGAA // reverse complements are TCATGG and TTCATG
>seq3
AACATGA #--> 6mers are AACATG and ACATGA // reverse complements are CATGTT and TCATGT
f2.fa
>seq9
NNCGCGATNNCGCGATNN
>seq85
NNTTCATG
what I want to do is:
file1
file2
to see if any of the reverse complement 6mers are inside then mark those sequences against each other with 1 else with 0so based on the example sequences above - one of the reverse complement 6mers from seq1 (CGCGAT
) is inside seq9 of file2 - also one of the reverse complement 6mers from seq20 (TTCATG
) is inside seq85 of file2.
so the desired output should be as below:
seq1 seq20 seq3
name
seq9 1 0 0
seq85 0 1 0
but currently, I get:
seq1 seq20 seq3
name
seq9 0 0 0
seq85 0 1 0
My code is:
from Bio import SeqIO
import pandas as pd
def same_seq(a_record, brecord):
window_size = 7
for j in range(0, len(a_record.seq) - window_size + 1):
for k in (a_record.seq.reverse_complement()[j: j + 6].split()):
return brecord.seq.find(k) != -1
if __name__ == '__main__':
records = list(SeqIO.parse("f1.fa", "fasta"))
target_records = list(SeqIO.parse("f2.fa", "fasta"))
rows_list = []
for target_record in target_records:
new_row = {'name': target_record.name}
for record in records:
if same_seq(record, target_record):
new_row[record.name] = 1
else:
new_row[record.name] = 0
rows_list.append(new_row)
df = pd.DataFrame(rows_list)
df = df.set_index(["name"])
print(df)
I am not sure if the first part or the second part of the code is causing the problem. It would be great if you could help me to fix my code. Thank you
Upvotes: 1
Views: 176
Reputation: 892
It seems there is a bug in your code. Change the same_seq
function as follows:
def same_seq(a_record, brecord):
window_size=6
for j in range(len(a_record.seq)- window_size):
for k in (a_record.seq.reverse_complement()[j: j + 6].split()):
if brecord.seq.find(k) != -1:
return 1
return 0
Hope it works!
Upvotes: 2