Reputation: 862
I'm working with files with this format:
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
As you can see, every SPEC line is different, except two where number of the string spectrum is repeated. What I'd like to do is take every chunk of information between the pattern =Cluster=
and check if there are lines with spectrum value repeated. In case there are several lines repeated, removes all of them except one.
The output file should be like this:
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
I was using groupby
from itertools module. I assume my input file is called f_input.txt and the output file is called new_file.txt, but this script remove the words SPEC as well... And I don't know what I can change in order to don't do this.
from itertools import groupby
data = (k.rstrip().split("=Cluster=") for k in open("f_input.txt", 'r'))
final = list(k for k,_ in groupby(list(data)))
with open("new_file.txt", 'a') as f:
for k in final:
if k == ['','']:
f.write("=Cluster=\n")
elif k == ['']:
f.write("\n\n")
else:
f.write("{}\n".join(k))
EDIT: New conditional. Sometimes part of the line number can change, for example:
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true
SPEC PRD000682;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true
As you can see, the last line has changed the part PRDnumber. One solution would be check the spectrum number, and remove the line based in repeated spectrum.
This would be a solution:
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true
Upvotes: 0
Views: 141
Reputation: 2830
Shortest solution in Python :p
import os
os.system("""awk 'line != $0; { line = $0 }' originalfile.txt > dedup.txt""")
output:
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
(if you are on Windows, awk can be installed easily with Gow.)
Upvotes: 1
Reputation: 15204
This is how I would do it.
file_in = r'someFile.txt'
file_out = r'someOtherFile.txt'
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
seen_spectra = set()
for line in f_in:
if '=Cluster=' in line or line.strip() == '':
seen_spectra = set()
f_out.write(line)
else:
new_spectrum = line.rstrip().split('=')[-1].split()[0]
if new_spectrum in seen_spectra:
continue
else:
f_out.write(line)
seen_spectra.add(new_spectrum)
This is not a groupby
solution but a solution that you can easily follow and debug if you have to. As you mentioned in the comments, this file of yours is 16GB big and loading it to memory is probably not the best idea..
EDIT: "Each cluster has a specific spectrum. It is not possible to have one spec in one cluster and the same in another"
file_in = r'someFile.txt'
file_out = r'someOtherFile.txt'
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
seen_spectra = set()
for line in f_in:
if line.startswith('SPEC'):
new_spectrum = line.rstrip().split('=')[-1].split()[0]
if spectrum in seen_spectra:
continue
else:
seen_spectra.add(new_spectrum)
f_out.write(line)
else:
f_out.write(line)
Upvotes: 3
Reputation: 92874
The solution using re.search() function and custom spectrums
set object for keeping only unique spectrum
numbers:
with open('f_input.txt') as oldfile, open('new_file.txt', 'w') as newfile:
spectrums = set()
for line in oldfile:
if '=Cluster=' in line or not line.strip():
newfile.write(line)
else:
m = re.search(r'spectrum=(\d+)', line)
spectrum = m.group(1)
if spectrum not in spectrums:
spectrums.add(spectrum)
newfile.write(line)
Upvotes: 0
Reputation: 3848
This will open your file containing your original code as well as a new file that will output the unique lines per group.
seen
is a set
and is great for seeing if something exists within it already.
data
is a list
and will keep track of the iterations of "=Cluster="
groups.
Then you simply review each line of each of the groups (designated as i
within data
).
If the line does not exist within seen
it is added.
with open ("input file", 'r') as in_file, open("output file", 'w') as out_file:
data = [k.rstrip().split("=Cluster=") for k in in_file]
for i in data:
seen = set()
for line in i:
if line in seen:
continue
seen.add(line)
out_file.write(line)
EDIT: Moved seen=set()
to within the for i in data
to reset the set each time otherwise "=Cluster="
would always exist and would not print for each group within data
.
Upvotes: 2