Sayantan Ghosh
Sayantan Ghosh

Reputation: 338

How to convert a generated text file to a tsv data form through python?

So I have a text file of the following format:

Splice site predictions for 46 sequences with donor score cutoff 0.40, acceptor score cutoff 0.40 (exon/intron boundary shown in larger font):
Acceptor site predictions for MFSD8 :
Start   End    Score     Intron               Exon
1       41     0.96      atttgtgtttttctttttaagagaacatcgtgtggatgact

Donor site predictions for CEP290 :

Start   End    Score     Exon   Intron
1       15     0.72      acttcaggtatactc

And more such repeat format entries are there in the file. I want to convert this into a TSV file of the following format:

id        type         score    seq
MFSD8     acceptor     0.96     atttgtgtttttctttttaagagaacatcgtgtggatgact
CEP290    donor        0.72     acttcaggtatactc

So, the acceptor or donor written in the line becomes the 'type', the id is written after the for and the score and seq is consumed from the respective headers. I am trying the following code:

for line in file.readlines():
    ENTRY = copy.deepcopy(ENTRY_T)
    site_type, gene, score, seq = ['','','','']

    ENTRY['predictor'] = 'NNSplice'
    line_elements = line.split()
    if line_elements[0] == 'Donor' or line_elements[0] == 'Acceptor':
        ENTRY['splice_site_type'] = line_elements[0]
        ENTRY['Gene_name'] = line_elements[-2]
    elif line_elements[0] == 'Start':
        pass
    elif int(line_elements[0]).isdigit():
        ENTRY['sequence'] = line_elements[-1]
        ENTRY['score'] = line_elements[-2]
    elif line_elements == []:
        pass

    field_values = [ENTRY[i] for i in HEADERS]
    print(field_values)

Any suggestions as to how to do this?

Upvotes: 0

Views: 2450

Answers (1)

Martin Evans
Martin Evans

Reputation: 46759

You will need to parse the file a line at a time to extract the bits you need. The following code shows one possible approach.

Firstly it reads the text file a line at a time. It spots if it is an acceptor or donor section. If so it extracts the ID from the end of the line for later use.

It then takes the text line and uses Python's CSV reader to split it into possible elements. If the number of elements is 4 then it assumes this is a valid data row. It then uses the row elements to construct as suitable output row.

A csv.writer() is used with a tab delimiter to write your output file.

import csv
from io import StringIO

with open('predictions.txt') as f_predictions, open('output.tsv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output, delimiter='\t')
    csv_output.writerow(['id', 'type', 'score', 'seq'])

    for line in f_predictions:
        if line.startswith('Acceptor'):
            type = 'acceptor'
            id = line.split(' ')[4]
        elif line.startswith('Donor'):
            type = 'donor'
            id = line.split(' ')[4]
        else:
            row = next(csv.reader(StringIO(line), delimiter=' ', skipinitialspace=True))

            if len(row) == 4:
                csv_output.writerow([id, type, row[2], row[3]])

For your given file, this would you an output tab separated output file as follows:

id  type    score   seq
MFSD8   acceptor    0.96    atttgtgtttttctttttaagagaacatcgtgtggatgact
CEP290  donor   0.72    acttcaggtatactc

Upvotes: 1

Related Questions