Reputation: 129
First, I apologize for my pythonic ignorance. I need to break up my multi-sequence fasta file into equal sized chunks for a pipeline downstream. I haven't run across anything that does this easily or in the format I'm looking for.
An example fasta file input:
ACGTA
>contig2GGGATAGTCA
>contig3GACTACTTTT
The above example fasta has 25bp. If I set the "chunk number" parameter to '4', then I would expect my output files to all have 7 base pairs, except for the last file which would have the remainder 4bp. It would look like this:
ACGTA
>contig2:0-1GG
GATAGTC
A
>contig3:0-5GACTAC
TTTT
Notice each resulting chunk*.fas includes 7 base pairs, except for the leftover base pairs in chunk4.fas. Also, each resulting sequence header in the chunk files have changed from the original, so that they include a ":" as well as a "start" and "stop" location derived from the original sequence.
The biopython cookbook has a pretty nice batch size iterator tool and I think my answer resides within manipulating this code, but I haven't a clue how to go about this.
Any help is appreciated. Cheers.
def batch_iterator(iterator, batch_size):
"""Returns lists of length batch_size.
This can be used on any iterator, for example to batch up
SeqRecord objects from Bio.SeqIO.parse(...), or to batch
Alignment objects from Bio.AlignIO.parse(...), or simply
lines from a file handle.
This is a generator function, and it returns lists of the
entries from the supplied iterator. Each list will have
batch_size entries, although the final list may be shorter.
"""
entry = True # Make sure we loop once
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = next(iterator)
except StopIteration:
entry = False
if not entry:
# End of file
break
batch.append(entry)
if batch:
yield batch
record_iter = SeqIO.parse('aVan.fa', 'fasta')
for i, batch in enumerate(batch_iterator(record_iter, 1000), start=1):
filename = 'group_{}.fasta'.format(i)
count = SeqIO.write(batch, filename, 'fasta')
print('Wrote {} records to {}'.format(count, filename))
Upvotes: 1
Views: 2479
Reputation: 2845
That is not an easy task, but have a look at this implementation:
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
chunk_number = 4
records = list(SeqIO.parse("input.fasta", "fasta"))
chunk_size = sum(len(r) for r in records) // chunk_number + 1
def create_batch(records, chunk_size):
record_it = iter(records)
record = next(record_it)
current_base = 0
batch = []
batch_size = 0
# While there are new records, keep creating new batches.
while record:
# Loop over records untill the batch is full. (or no new records)
while batch_size != chunk_size and record:
end = current_base + chunk_size - batch_size
seq = record[current_base:end]
end_of_slice = current_base + len(seq) - 1
fasta_header = record.id + ":{}-{}".format(current_base, end_of_slice)
seq.id = seq.name = fasta_header
seq.description = ''
batch.append(seq)
current_base += len(seq)
batch_size += len(seq)
# Current record is exhausted, get a new one.
if current_base >= len(record):
record = next(record_it, None)
current_base = 0
# We have a batch with the correct size (or no new bathces)
yield batch
batch = []
batch_size = 0
for i, batch in enumerate(create_batch(records, chunk_size)):
filename = "chunk{}.fasta".format(i)
SeqIO.write(batch, filename, "fasta")
Upvotes: 2