Apex
Apex

Reputation: 1096

How to intersect two FASTA files using Python

Please give your hints - I have spent a lot of time on this code but have not been able to solve it.

I have a code that gives output in a format I want however the results it shows are not 100% correct.

I have two big FASTA files that look like below (information after # are not in the actual file - I put them in this question for better understanding):

f1.fa

>seq1
ATCGCGT  #--> 6mers are ATCGCG and TCGCGT // reverse complements are CGCGAT and ACGCGA
>seq20
CCATGAA  #--> 6mers are CCATGA and CATGAA // reverse complements are TCATGG and TTCATG
>seq3
AACATGA  #--> 6mers are AACATG and ACATGA // reverse complements are CATGTT and TCATGT

f2.fa

>seq9
NNCGCGATNNCGCGATNN
>seq85
NNTTCATG

what I want to do is:

  1. to read 6mers (6 character windows) of reach sequence from file1
  2. make reverse complements of 6mers (this is not the problem)
  3. look for any sequence (these are target sequences) in file2 to see if any of the reverse complement 6mers are inside then mark those sequences against each other with 1 else with 0
  4. report the result as a table.

so based on the example sequences above - one of the reverse complement 6mers from seq1 (CGCGAT) is inside seq9 of file2 - also one of the reverse complement 6mers from seq20 (TTCATG) is inside seq85 of file2.

so the desired output should be as below:

         seq1  seq20  seq3
name                    
seq9      1      0     0
seq85     0      1     0

but currently, I get:

         seq1  seq20  seq3
name                    
seq9      0      0     0
seq85     0      1     0

My code is:

from Bio import SeqIO
import pandas as pd


def same_seq(a_record, brecord):
    window_size = 7
    for j in range(0, len(a_record.seq) - window_size + 1):  
        for k in (a_record.seq.reverse_complement()[j: j + 6].split()):  
            return brecord.seq.find(k) != -1


if __name__ == '__main__':
    records = list(SeqIO.parse("f1.fa", "fasta"))
    target_records = list(SeqIO.parse("f2.fa", "fasta"))
    rows_list = []
    for target_record in target_records:
        new_row = {'name': target_record.name}
        for record in records:
            if same_seq(record, target_record):
                new_row[record.name] = 1
            else:
                new_row[record.name] = 0
        rows_list.append(new_row)
    df = pd.DataFrame(rows_list)
    df = df.set_index(["name"])
    print(df)

I am not sure if the first part or the second part of the code is causing the problem. It would be great if you could help me to fix my code. Thank you

Upvotes: 1

Views: 176

Answers (1)

Hamid Ghasemi
Hamid Ghasemi

Reputation: 892

It seems there is a bug in your code. Change the same_seq function as follows:

def same_seq(a_record, brecord):
    window_size=6
    for j in range(len(a_record.seq)- window_size):
        for k in (a_record.seq.reverse_complement()[j: j + 6].split()):
            if brecord.seq.find(k) != -1:
                    return 1
    return 0

Hope it works!

Upvotes: 2

Related Questions