Reputation: 338
So I have a text file of the following format:
Splice site predictions for 46 sequences with donor score cutoff 0.40, acceptor score cutoff 0.40 (exon/intron boundary shown in larger font):
Acceptor site predictions for MFSD8 :
Start End Score Intron Exon
1 41 0.96 atttgtgtttttctttttaagagaacatcgtgtggatgact
Donor site predictions for CEP290 :
Start End Score Exon Intron
1 15 0.72 acttcaggtatactc
And more such repeat format entries are there in the file. I want to convert this into a TSV file of the following format:
id type score seq
MFSD8 acceptor 0.96 atttgtgtttttctttttaagagaacatcgtgtggatgact
CEP290 donor 0.72 acttcaggtatactc
So, the acceptor or donor written in the line becomes the 'type', the id is written after the for and the score and seq is consumed from the respective headers. I am trying the following code:
for line in file.readlines():
ENTRY = copy.deepcopy(ENTRY_T)
site_type, gene, score, seq = ['','','','']
ENTRY['predictor'] = 'NNSplice'
line_elements = line.split()
if line_elements[0] == 'Donor' or line_elements[0] == 'Acceptor':
ENTRY['splice_site_type'] = line_elements[0]
ENTRY['Gene_name'] = line_elements[-2]
elif line_elements[0] == 'Start':
pass
elif int(line_elements[0]).isdigit():
ENTRY['sequence'] = line_elements[-1]
ENTRY['score'] = line_elements[-2]
elif line_elements == []:
pass
field_values = [ENTRY[i] for i in HEADERS]
print(field_values)
Any suggestions as to how to do this?
Upvotes: 0
Views: 2450
Reputation: 46759
You will need to parse the file a line at a time to extract the bits you need. The following code shows one possible approach.
Firstly it reads the text file a line at a time. It spots if it is an acceptor
or donor
section. If so it extracts the ID from the end of the line for later use.
It then takes the text line and uses Python's CSV reader to split it into possible elements. If the number of elements is 4
then it assumes this is a valid data row. It then uses the row elements to construct as suitable output row.
A csv.writer()
is used with a tab delimiter to write your output file.
import csv
from io import StringIO
with open('predictions.txt') as f_predictions, open('output.tsv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output, delimiter='\t')
csv_output.writerow(['id', 'type', 'score', 'seq'])
for line in f_predictions:
if line.startswith('Acceptor'):
type = 'acceptor'
id = line.split(' ')[4]
elif line.startswith('Donor'):
type = 'donor'
id = line.split(' ')[4]
else:
row = next(csv.reader(StringIO(line), delimiter=' ', skipinitialspace=True))
if len(row) == 4:
csv_output.writerow([id, type, row[2], row[3]])
For your given file, this would you an output tab separated output file as follows:
id type score seq
MFSD8 acceptor 0.96 atttgtgtttttctttttaagagaacatcgtgtggatgact
CEP290 donor 0.72 acttcaggtatactc
Upvotes: 1