Reputation: 33
The following below is python script that identifies whether certain words are found or not found in a list of different files.
experiment=open('potentiation.txt')
lines=experiment.read().splitlines()
receptors=['crystal_1.txt', 'modeller_1.txt', 'moe_1.txt',
'nci5_modeller0000_1.txt', 'nci5_modeller0001_1.txt',
'nci5_modeller0002_1.txt', 'nci5_modeller0003_1.txt',
'nci5_modeller0004_1.txt', 'nci5_modeller0005_1.txt',
'nci5_modeller0006_1.txt', 'nci5_modeller0007_1.txt',
'nci5_modeller0008_1.txt', 'nci5_modeller0009_1.txt',
'nci5_modeller0010_1.txt', 'nci5_modeller0011_1.txt',
'nci5_moe0000_1.txt', 'nci5_moe0001_1.txt', 'nci5_moe0002_1.txt',
'nci5_moe0003_1.txt', 'nci5_moe0004_1.txt', 'nci5_moe0005_1.txt',
'nci5_moe0006_1.txt', 'nci5_moe0007_1.txt', 'nci5_moe0008_1.txt',
'nci5_moe0009_1.txt', 'nci5_moe0010_1.txt', 'nci5_moe0011_1.txt',
'nci5_moe0012_1.txt', 'nci5_moe0013_1.txt', 'nci5_moe0014_1.txt']
for ligand in lines:
for protein in receptors:
file1=open(protein,"r")
read1=file1.read()
find_hit=read1.find(ligand)
if find_hit == -1:
print ligand,protein,"Not Found"
else:
print ligand,protein, "Found"
An example of the output of this code is below:
345647 nci5_moe0012_1.txt Not Found
345647 nci5_moe0013_1.txt Not Found
345647 nci5_moe0014_1.txt Found
My question is how can I take the output and format it into a csv file that looks like the example below?
Ligand nci5_moe0012_1. nci5_moe_0013_1 nci5_moe_0014
345647 Not Found Not Found Found
Upvotes: 3
Views: 2144
Reputation: 123453
I think something like this would do it (assuming your output file is tab-delimited):
import csv
import os
receptors = ['crystal_1', 'modeller_1', 'moe_1',
'nci5_modeller0000_1', 'nci5_modeller0001_1',
'nci5_modeller0002_1', 'nci5_modeller0003_1',
'nci5_modeller0004_1', 'nci5_modeller0005_1',
'nci5_modeller0006_1', 'nci5_modeller0007_1',
'nci5_modeller0008_1', 'nci5_modeller0009_1',
'nci5_modeller0010_1', 'nci5_modeller0011_1',
'nci5_moe0000_1', 'nci5_moe0001_1', 'nci5_moe0002_1',
'nci5_moe0003_1', 'nci5_moe0004_1', 'nci5_moe0005_1',
'nci5_moe0006_1', 'nci5_moe0007_1', 'nci5_moe0008_1',
'nci5_moe0009_1', 'nci5_moe0010_1', 'nci5_moe0011_1',
'nci5_moe0012_1', 'nci5_moe0013_1', 'nci5_moe0014_1']
with open('potentiation.txt', 'rt') as experiment, \
open('output.csv', 'wb') as outfile:
csv_writer = csv.writer(outfile, delimiter='\t')
csv_writer.writerow(['Ligand'] + receptors) # header row
for ligand in (line.rstrip() for line in experiment):
row = [ligand]
for protein in receptors:
with open(protein+'.txt', "rt") as file1:
found = ['Found', 'Not Found'][file1.read().find(ligand) == -1]
row.append(found)
csv_writer.writerow(row)
print('output.csv file written')
Update
As I said in a comment this could be done a lot faster by only reading the protein files once. In order to be able to do that and format the output the way you want, the results of checking for each ligand in each file need to stored in a data-structure built-up incrementally as each file is read and then checked multiple times, only to be written out, all-at-once, after all have been done. A simple list-of-lists is adequate for this purpose and has been used in implementation below.
The trade-off is using more memory vs reading and rereading the protein files over-and-over. Since disk IO is often one of the slowest things on a computer, the potentially large performance gain for only a slight increase in code-complexity is probably worthwhile.
Here's the code showing this alternative version:
import csv
import os
receptors = ['crystal_1', 'modeller_1', 'moe_1',
'nci5_modeller0000_1', 'nci5_modeller0001_1',
'nci5_modeller0002_1', 'nci5_modeller0003_1',
'nci5_modeller0004_1', 'nci5_modeller0005_1',
'nci5_modeller0006_1', 'nci5_modeller0007_1',
'nci5_modeller0008_1', 'nci5_modeller0009_1',
'nci5_modeller0010_1', 'nci5_modeller0011_1',
'nci5_moe0000_1', 'nci5_moe0001_1', 'nci5_moe0002_1',
'nci5_moe0003_1', 'nci5_moe0004_1', 'nci5_moe0005_1',
'nci5_moe0006_1', 'nci5_moe0007_1', 'nci5_moe0008_1',
'nci5_moe0009_1', 'nci5_moe0010_1', 'nci5_moe0011_1',
'nci5_moe0012_1', 'nci5_moe0013_1', 'nci5_moe0014_1']
# initialize list of lists holding each ligand and its presence in each receptor
with open('potentiation.txt') as experiment:
ligands = [[ligand] for ligand in (line.rstrip() for line in experiment)]
for protein in receptors:
with open(protein + '.txt') as protein_file:
protein_file_data = protein_file.read()
for row in ligands:
# determine if this ligand (row[0]) appears in protein data
row.append('Found' if row[0] in protein_file_data else 'Not Found')
with open('output.csv', 'wb') as outfile:
csv_writer = csv.writer(outfile, delimiter='\t')
csv_writer.writerow(['Ligand'] + receptors) # header row
csv_writer.writerows(ligands)
print('output.csv file written')
Upvotes: 3
Reputation: 1017
You can save your result in lists (one list for ligand, one for proteins), after you add the "Protein" and the value of "Ligand" to appropriate list (in 0 index). After it's easy to save it text file.
For saving you open a file for writing and transform list in string:
my_string = " ".join(map(str, lst))
and then save my_string (And do it for each list)
Upvotes: 0