chippycentra
chippycentra

Reputation: 3432

Split a FASTA alignment file into all possible alignment pair in Python

I have an alignment FASTA file:

>SEQ1
AGGGGGCGGATAGGCAG-C--AGGCGCGGAGAGCGGCAGCG---GG---GGACG-AC-
AGGAACACCTATACTTcCTTCTAC-CAGACGAAC-------CGAGACa
>SEQ2
TGGGGGCGGATGGGCAG----AGGCGCGGAGAGCGGCAGCG-A-GG---GGACG-AC-
AGGAACATCTAHACCCcCTTCTAC-CAGACGAAC-------CGACACa
>SEQ3
TGGGGGCGGATGGGCAG-T--AGGCGCGGATAGCGGCAGCG-A-GG---GGACG-AC-
AGGAACATCTATACCCcCTTCTAC-CAGACGAAC-------CGACACa
>SEQ4
TGGGGGCGGATGGGCAG-A--AGGCGCGGTGAGCGGCAGCG-A-GG---GGACG-AC-
AGGAACATCTACACCCcCTTCTAC-CAGACGAAC-------CGACACa
>SEQ5
TGGGGGCGGATGGGCAG-G--AGGCGCGGAGAGCGGCAGCG-A-GG---GGACG-AC-
AGGAACATCTATACCCcCTTCTAC-CAGACGAAC-------CGACACa

and I would like to open this file, and create all the possible pair between SEQn

(without the same sequence in a pair)

for instance if there are 5 SEQ, then there will be n(n-1)/2 5(5-1)/2 = 10 groups so

SEQ1 vs SEQ2
SEQ1 vs SEQ3
SEQ1 vs SEQ4
SEQ1 vs SEQ5
SEQ2 vs SEQ3
SEQ2 vs SEQ4
SEQ2 vs SEQ5
SEQ3 vs SEQ4
SEQ3 vs SEQ5
SEQ4 vs SEQ5

and I want to create 10 new FASTA alignment file:

SEQ1 _vs_SEQ2.fa

>SEQ1
AGGGGGCGGATAGGCAG-C--AGGCGCGGAGAGCGGCAGCG---GG---GGACG-AC-
AGGAACACCTATACTTcCTTCTAC-CAGACGAAC-------CGAGACa
>SEQ2
TGGGGGCGGATGGGCAG----AGGCGCGGAGAGCGGCAGCG-A-GG---GGACG-AC-
AGGAACATCTAHACCCcCTTCTAC-CAGACGAAC-------CGACACa

.
.
.

SEQ4 _vs_SEQ5.fa

>SEQ4
TGGGGGCGGATGGGCAG-A--AGGCGCGGTGAGCGGCAGCG-A-GG---GGACG-AC-
AGGAACATCTACACCCcCTTCTAC-CAGACGAAC-------CGACACa
>SEQ5
TGGGGGCGGATGGGCAG-G--AGGCGCGGAGAGCGGCAGCG-A-GG---GGACG-AC-
AGGAACATCTATACCCcCTTCTAC-CAGACGAAC-------CGACACa

Does someone have a fast idea (I have more then 122 taxa within each group which mean that I need to create 7381 files and 200 times)?

SO far I used:

records=SeqIO.to_dict(SeqIO.parse("mysequence.fasta", "fasta"))
list_taxa=list(records)
#Create all possible pairs 
List_pairs=[(list_taxa[i],list_taxa[j]) for i in range(len(list_taxa)) for j in range(i+1, len(list_taxa))]

count=0
SEQ_NAME="Name_seq"
for i in List_pairs:
  new_pair_name=i[0]+"_"+i[1]+"_"+SEQ_NAME
  output_file=open("/My_dir_where_to_create_the_files/"+new_pair_name+".fa","w")
  for a in i: 
    print(">",records[a].id,sep="",file=output_file)
    print(records[a].seq,file=output_file)
  print(int(count),"/",len(list(records)))
  count+=1

but it is not so efficient for the part where I create the pairs alignment files...

Upvotes: 0

Views: 154

Answers (1)

Hunted
Hunted

Reputation: 88

Assuming all the files are in the same directory, this should be sufficient to have a list of all of the combos possible.

import os

ListofFiles = []
for file in os.path(WhateverYourDirectoryContainingTheFilesIs):
    if file.startswith('SEQ'):
        ListofFiles.append(file)
        
length = len(ListofFiles)
FileCombos = []

for i in range(length):
    for j in range(i+1, length):
        FileCombos.append(ListofFiles[i] + " vs " + ListofFiles[j])

From here, it shouldn't be too difficult to create your files as you need them, in your preferred method. I may come back to this tomorrow for a more pythonic solution, if no one else cooks one up before then.

Upvotes: 1

Related Questions