rhkss
rhkss

Reputation: 55

split the file in many other files using python

I have a file which I want to split it into many other parts. i want to use python code...

Eg: the data in my file is like this

>2165320 21411 200802 8894-,...,765644-
TTCGGAGCTTACTAATTTTAAATATGAAGAATGCCAATATAAGTTTTGATTTCGAAAATACTTTTTTACTAGTTAAAAATTCATGATTTTCTACATCTATAACAATTTGTGTTTTTTTTAAACATCTTCCAGTGTCCTAAGTGTATATTTTTTAACGCAATGTTTGAATACTTTTAGGGTTTACCTTATTTAATTTGATTTTTAATGTGAGTTGTAATCACTGGTGAGCATACTGTTTTTCTTTTGTTCAGTAATATTGCATTTGTAGCTTTTGTATTGCTTAGATATATCACATTAAATCCTTTGTTCAGAAACCCATCCGACAGGGAGTCATAGGTGCCACACTAGTGGTCGAGGATCTAGGATGTCGGAAGGTCAACAATGGGGTAAAACACTAATTTTTTAATTTCTTGTATTTACCAAATTTACTGATTTTGCATTTAGTAGATGGTATATATACTCTTCTACCTTGTACAGTTGATGGTACCTGACTAAATATGTTTTATTTCCTTCTCCAGGATCTTTATGTAGTACGATTCTACAGTCGTCAAGAGGAGGGTAGAAAAGGAGAAGTAAGTTATAATATTTCTGAGCTTTTTTCTTTTTAATTGTTGTTGATAGAAAGTTGTGCCATATACATGTTTTAAGGTGGTGTA

>2165799 14641 135356 16580+,...,680341-
AAGGTAGGAGGTACTCGTGCTAATGGAGGAGCTAATGGTACACCAAACCGACGGCTGTCACTTAATGCTCATCAAAACGGAAGCAGGTCCACAACAAAAGATGGAAAAAAAGACATCAGACCAGTTGCTCCTGTGAATTATGTGGCCATATCAAAAGAAGATGCTGCTTCCCATGTTTCTGGTACCGAACCAATCCCGGCATCACCCTAATAATGAGATCTTCATTATCAACCCTACAATTTCATCTTTGTAGCATGATCAAATACTAGTTACTGCTTTAGGAATTATAATATGGAGTGACAAGTAATTAGAGAGGAACTGTTTTGAGCTGTGTATGTTCAATTTGCCATTTGGAGGTTTTCTCAATACATGTGCCCTTTAATATGAAAATATAGTGCTATTCTTGCCTTTCTCCAAACCCTGGCTCCTCCTATTCATCGGTTTCTT

>2169677 23891 1928391 1298391,…..,739483-
CTAGCTGATCGAGCTGATCGTAGTGAGCTATCGAGCTGACTACTAGCTAGTCGTGATAGCTGATCGAGCTGACTGATGTGCTAGTAGTAGTTTCATGATTTTCTACATCTATAACAATTTGTGTTTTTTTTAAACATCTTCCAGTGTCCTAAGTGTATATTTTTTAACGCAATGTTTGAATACTTTTAGGGTTTACCTTATTTAATTTGATTTTTAATGTGAGTTGTAATCACTGGTGAGCATACTGTTTTTCTTTTGTTCAGTAATATTGCATTTGTAGCTTTTGTATTGCTTAGATATATCACATTAAATCCTTTGTTCAGAAACCCATCCGACAGGGAGTCATAGGTGCCACACTAGTGGTCGAGGATCTAGGATGTCGGAAGGTCAACAATGGGGTAAAACACTAATTTTTTAATTTCTTGTATTTACCAAATTTACTGATTTTGCATTTAGTAGATGGTATATATACTCTTCTACCTTGTACAGTTGATGGTACCTGACTAAATATGTTTTATTTCCTTCTCCAGGATCTTTATGTAGTACGATTCTACAGTCGTCAAGAGGAGGGTAGAAAAGGAGAAGTAAGTTATAATATTTCTGAGCTTTTTTCTTTTTAATTGTTGTTGATAGAAAGTTGTGCCATATACATGTTTTA

And so on.

So now I want to split the file from ’>’ sing to next one n store this in a separate file.

Like 1st file will have

>2165320 21411 200802 8894-,...,765644-
TTCG…..GTA    

data.

2nd file will have

>2165799 14641 135356 16580+,...,680341-
AAGG….GTTTCTT     

data and so on.

Upvotes: 2

Views: 192

Answers (3)

jdi
jdi

Reputation: 92559

It seems your data is just newline separated, so all you would need to do is loop over the lines and write the non-blank ones to incrementing files:

with open("source.txt") as f:
    counter = 1
    for line in f:
        if not line.strip():
            continue
        with open("out_%03d.txt" % counter, 'w') as out:
            out.write(line)
        counter += 1

This will assume that each group is really one long line (it wasn't clear to me the real format).

Because you haven't given us much explanation about the real format of this file, here is another option in case those really are newline characters between lines that should be in the same file. If "@" is a solid indicator of a new group, we can just use it to indicate a new file:

with open("source.txt") as f:
    counter = 1
    out = None 

    for line in f:
        if line.lstrip().startswith("@"):
            if out is not None:
                out.close()
            out_name = "out_%03d.txt" % counter
            counter += 1
            out = open(out_name, 'w')

        out.write(line)

    if out is not None:
        out.close()

Upvotes: 1

09dzxue
09dzxue

Reputation: 27

with open("source.txt") as f:        
     counter = 1
     for line in f:
        if counter % 3 == 0:
            continue
        with open("out_%03d.txt" % counter, 'w') as out:
            out.write(line)
        counter += 1

Upvotes: 0

jfs
jfs

Reputation: 414129

To write each blank-line-separated group of lines to a separate file you could use itertools.groupby():

#!/usr/bin/env python
import sys
from itertools import groupby

def blank(line, mark=[0]):
    if not line.strip(): # blank line
       mark[0] ^= 1 # mark the start of new group
    return mark[0]

for i, (_, group) in enumerate(groupby(sys.stdin, blank), start=1):
    with open("group-%03d.txt" % (i,), "w") as outfile:
        outfile.writelines(group)

Usage:

$ python split-on-blank.py < input_file.txt

If you work with such formats often; consider using a proper parser such as provided by Bio.SeqIO.parse() function from biopython.

Upvotes: 1

Related Questions