Reputation: 131
I just started learning python and programming so this is probably a pretty naive question. But I'll appreciate any help.
The following code works, but I've been told that having these multiple inputs and output is bad and that I should instead nest the loops. But try as I might every time I try to nest anything it just ends up giving me an empty folder.
So my question is how to I nest all these?
Thanks and Sorry for the long post.
#1) I call a perl script and execute it to get the input file.
perl = "/usr/bin/perl"
perl_script = "geoFF.pl";
params = " --mount-doom-hot"
pl_script = subprocess.Popen([perl, perl_script, params], stdout=sys.stdout)
pl_script.communicate()
## 2) input the output from the perl script but only the wanted data.
# The input is a BIG file and I just want some specific lines from it.
infile1 = "inputperl.txt"
outfile1 = "c1.txt"
f1 = open(infile1,'rU')
o1 = open(outfile1,'w+')
words = ['Acc','title','orgn','date','GP'] #for lines in file f1 get lines with the words
for line in f1:
if any(words in line for words in words):
o1.write(line)
# From the specific lines delete some symbols/charactewords I don't want.
input1 =open("c1.txt",'rU')
output1 = open("c2.txt",'w')
del_list = ['>','title', 'orgn','date','<','GP','/Item','"','</Item>','<DS>','Name=','DocS','Acc'] # I want to keep the rest of the line but not these words.
for line in input1:
for word in del_list:
line = line.replace(word, "")
output1.write(line)
# For one specific word in the lines AB. The file has lines with AB129, AB8877, AB0997 and AB(etc). Here I want to attach and url so it will be an hyperlink.Attached url to GSE to get hyperlink
inp = open("c2.txt",'rU')
out= open("c3.txt",'w')
filedata2 = inp.read()
newdata2 = filedata2.replace('AB', "\n"'http://www.whatever.com/g/qu/acc.cgi?acc=AB')
out.write(newdata2)
# this output the line as http://www.whatever.com/g/qu/acc.cgi?acc=AB(somenumber)
#for example http://www.whatever.com/g/qu/acc.cgi?acc=AB129
#and http://www.whatever.com/g/qu/acc.cgi?acc=AB8877 etc.
### then I want to take this files with the changes and send it by email
from email.MIMEMultipart import MIMEMultipart
from email.MIMEText import MIMEText
fromaddr = "[email protected]"
toaddr = "[email protected]"
msg = MIMEMultipart()
msg['From'] = fromaddr
msg['To'] = toaddr
msg['Subject'] = "RESULT"
# send txt file in email body
f6 = (open("c3.txt",'rU'))
results = MIMEText(f6.read(),'plain')
f6.close()
msg.attach(results)
#convert to string
import smtplib
server = smtplib.SMTP('smtp.gmail.com', 587)
server.ehlo()
server.starttls()
server.ehlo()
server.login("sender email", "password")
text = msg.as_string()
server.sendmail(fromaddr, toaddr, text)
the input file looks like
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE>
<eSummaryResult>
<DS>
<Id>20006767</Id>
<Item Name="Acc" Type="String">AB64767</Item>
<Item Name="GDS" Type="String"></Item>
<Item Name="title" Type="String">word word title of this word...</Item>
<Item Name="summary" Type="String">word word word..word word word..</Item>
<Item Name="GP" Type="String">11002;13112</Item>
<Item Name="AB" Type="String">64767</Item>
<Item Name="orgn" Type="String">Mus musculus</Item>
<Item Name="entryType" Type="String">AB</Item>
<Item Name="gdsType" Type="String">word word word..word word word..word word word..</Item>
<Item Name="ptechType" Type="String"></Item>
<Item Name="valType" Type="String"></Item>
<Item Name="SSInfo" Type="String"></Item>
<Item Name="subsetInfo" Type="String"></Item>
<Item Name="date" Type="String">2015/12/09</Item>
<Item Name="suppFile" Type="String">WIG</Item>
<Item Name="Samples" Type="List">
</Item>
<Item Name="n_samples" Type="Integer">12</Item>
<Item Name="SeriesTitle" Type="String"></Item>
<Item Name="PlatformTitle" Type="String"></Item>
<Item Name="PlatformTaxa" Type="String"></Item>
<Item Name="SamplesTaxa" Type="String"></Item>
<Item Name="Ids" Type="List">
</Item>
<Id>200098567</Id>
<Item Name="Acc" Type="String">AB64789</Item>
<Item Name="GDS" Type="String"></Item>
<Item Name="title" Type="String">word word word...</Item>
<Item Name="summary" Type="String">word word word..word word word..</Item>
<Item Name="GP" Type="String">11002;13112</Item>
<Item Name="AB" Type="String">AB64789</Item>
<Item Name="orgn" Type="String">Mus musculus</Item>
<Item Name="entryType" Type="String">AB</Item>
<Item Name="gdsType" Type="String">word word word..word word word..word word word..</Item>
<Item Name="ptechType" Type="String"></Item>
<Item Name="valType" Type="String"></Item>
<Item Name="SSInfo" Type="String"></Item>
<Item Name="subsetInfo" Type="String"></Item>
<Item Name="date" Type="String">2015/12/09</Item>
<Item Name="suppFile" Type="String">WIG</Item>
<Item Name="Samples" Type="List">
</Item>
</Item>
<Id>200064997</Id>
<Item Name="Acc" Type="String">AB69957</Item>
<Item Name="GDS" Type="String"></Item>
<Item Name="title" Type="String">word word word...</Item>
<Item Name="summary" Type="String">word word word..word word word..</Item>
<Item Name="GP" Type="String">1100</Item>
<Item Name="AB" Type="String">69957</Item>
<Item Name="orgn" Type="String">Mus musculus</Item>
<Item Name="entryType" Type="String">AB</Item>
<Item Name="gdsType" Type="String">word word word..word word word..word word word..</Item>
<Item Name="ptechType" Type="String"></Item>
<Item Name="valType" Type="String"></Item>
<Item Name="SSInfo" Type="String"></Item>
<Item Name="subsetInfo" Type="String"></Item>
<Item Name="date" Type="String">2015/12/09</Item>
<Item Name="suppFile" Type="String">WIG</Item>
<Item Name="Samples" Type="List">
</Item>
<Item Name="n_samples" Type="Integer">12</Item>
<Item Name="SeriesTitle" Type="String"></Item>
<Item Name="PlatformTitle" Type="String"></Item>
<Item Name="PlatformTaxa" Type="String"></Item>
<Item Name="SamplesTaxa" Type="String"></Item>
<Item Name="Ids" Type="List">
<Item Name="int" Type="Integer">26476451</Item>
</Item>
<Item Name="Projects" Type="List"></Item>
<Item Name="G2R" Type="String">no</Item>
I just want the following data:
<Item Name="Acc" Type="String">AB64767</Item>
<Item Name="title" Type="String">word word title of this word...</Item>
<Item Name="AB" Type="String">64767</Item>
<Item Name="orgn" Type="String">Mus musculus</Item>
<Item Name="date" Type="String">2015/12/09</Item>
But showing as:
http://www.whatever.com/g/qu/acc.cgi?acc=AB64767
word word title of this word...
Mus musculus
2015/12/09
http://www.whatever.com/g/qu/acc.cgi?acc=AB64789
word word title of this word...
Mus musculus
2015/12/09
http://www.whatever.com/g/qu/acc.cgi?acc=AB69957
word word title of this word...
Mus musculus
2015/12/09
Upvotes: 1
Views: 91
Reputation: 180481
Reading the file once and using a regex would be a better approach:
import re
del_list = ['>', 'title', 'orgn', 'date', '<', 'GP', '/Item', '"', '</Item>', '<DS>', 'Name=', 'DocS',
'Acc'] # I want to keep the rest of the line but not these words.
words = ['Acc', 'title', 'orgn', 'date', 'GP']
rep = re.compile(r'|'.join(del_list))
keep = re.compile(r"|".join(words))
r3 = re.compile("AB(?=\d)")
with open("test.txt") as f, open("out.txt","w") as out:
for line in f:
# if line contains match from words
if keep.search(line):
# replace all unwanted substrings
line = rep.sub("", line.lstrip())
line = r3.sub("\n"'http://www.whatever.com/g/qu/acc.cgi?acc=AB', line)
out.write(line)
out.txt:
Item Type=String
http://www.whatever.com/g/qu/acc.cgi?acc=AB64767
Item Type=Stringword word of this word...
Item Type=String11002;13112
Item Type=StringMus musculus
Item Type=String2015/12/09
Item Type=String
http://www.whatever.com/g/qu/acc.cgi?acc=AB64789
Item Type=Stringword word word...
Item Type=String11002;13112
Item Type=StringMus musculus
Item Type=String2015/12/09
Item Type=String
http://www.whatever.com/g/qu/acc.cgi?acc=AB69957
Item Type=Stringword word word...
Item Type=String1100
Item Type=StringMus musculus
Item Type=String2015/12/09
If you are looking to match some words exactly then you will need to use word boundaries in the regexes or you will end up matching "foo" in "foobar"
, if all you want to do is send the file you don't have to write it to disk either.
Upvotes: 1
Reputation: 4985
While this is nowhere near complete here are some pointers:
Disk IO is slow, so if you just read once, do all your processing and then generate your outputs instead of going through a file for each filtering step you get better performance.
For example lets examen this:
for line in f1:
if any(words in line for words in words):
o1.write(line)
# From the specific lines delete some symbols/charactewords I don't want.
input1 =open("c1.txt",'rU')
output1 = open("c2.txt",'w')
del_list = ['>','title', 'orgn','date','<','GP','/Item','"','</Item>','<DS>','Name=','DocS','Acc'] # I want to keep the rest of the line but not these words.
for line in input1:
for word in del_list:
line = line.replace(word, "")
output1.write(line)
In the first loop you select only a few lines from you input file. In the second loop you delete some words from the selected lines. In between you write your entire data to disk.
A fairly simple optimization is to do the word replacing directly before writing back to disk, i.e.:
del_list = ['>','title', 'orgn','date','<','GP','/Item','"','</Item>','<DS>','Name=','DocS','Acc']
for line in f1:
if any(words in line for words in words):
for word in del_list:
line = line.replace(word, "")
o1.write(line)
Can you see how this saves a roundtrip to disk? Alternative techniques are to hold the data in memory by reading the file into a list
and then operating on that list rather than going back and forth to disk every time.
I hope this points you the right way, surly you can now figure out how to get rid of the third set of files, so that you end up with only one input file and one output file.
Upvotes: 1