Reputation: 862

Is there any way to remove duplicate string in a string based on pattern?

I'm working with files with this format:

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true


=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

As you can see, every SPEC line is different, except two where number of the string spectrum is repeated. What I'd like to do is take every chunk of information between the pattern =Cluster= and check if there are lines with spectrum value repeated. In case there are several lines repeated, removes all of them except one.

The output file should be like this:

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true


=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

I was using groupby from itertools module. I assume my input file is called f_input.txt and the output file is called new_file.txt, but this script remove the words SPEC as well... And I don't know what I can change in order to don't do this.

from itertools import groupby

data = (k.rstrip().split("=Cluster=") for k in open("f_input.txt", 'r'))
final = list(k for k,_ in groupby(list(data)))

with open("new_file.txt", 'a') as f:
    for k in final:
        if k == ['','']:
            f.write("=Cluster=\n")
        elif k == ['']:
            f.write("\n\n")
        else:
            f.write("{}\n".join(k))

EDIT: New conditional. Sometimes part of the line number can change, for example:

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true
SPEC PRD000682;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

As you can see, the last line has changed the part PRDnumber. One solution would be check the spectrum number, and remove the line based in repeated spectrum.

This would be a solution:

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

Upvotes: 0

Answers (4)

Ettore Rizza

Reputation: 2830

Shortest solution in Python :p

import os
os.system("""awk 'line != $0; { line = $0 }' originalfile.txt > dedup.txt""")

output:

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

(if you are on Windows, awk can be installed easily with Gow.)

Upvotes: 1

Ma0

Reputation: 15204

This is how I would do it.

file_in = r'someFile.txt'   
file_out = r'someOtherFile.txt'
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
    seen_spectra = set()
    for line in f_in:
        if '=Cluster=' in line or line.strip() == '':
            seen_spectra = set()
            f_out.write(line)
        else:
            new_spectrum = line.rstrip().split('=')[-1].split()[0]
            if new_spectrum in seen_spectra:
                continue
            else:
                f_out.write(line)
                seen_spectra.add(new_spectrum)

This is not a groupby solution but a solution that you can easily follow and debug if you have to. As you mentioned in the comments, this file of yours is 16GB big and loading it to memory is probably not the best idea..

EDIT: "Each cluster has a specific spectrum. It is not possible to have one spec in one cluster and the same in another"

file_in = r'someFile.txt'   
file_out = r'someOtherFile.txt'
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
    seen_spectra = set()
    for line in f_in:
        if line.startswith('SPEC'):
            new_spectrum = line.rstrip().split('=')[-1].split()[0]
            if spectrum in seen_spectra:
                continue
            else:
                seen_spectra.add(new_spectrum)      
                f_out.write(line)          
        else:
            f_out.write(line)

Upvotes: 3

RomanPerekhrest

Reputation: 92874

The solution using re.search() function and custom spectrums set object for keeping only unique spectrum numbers:

with open('f_input.txt') as oldfile, open('new_file.txt', 'w') as newfile:
    spectrums = set()
    for line in oldfile:
        if '=Cluster=' in line or not line.strip():
            newfile.write(line)
        else:
            m = re.search(r'spectrum=(\d+)', line)
            spectrum = m.group(1)
            if spectrum not in spectrums:
                spectrums.add(spectrum)
                newfile.write(line)

Upvotes: 0

pstatix

Reputation: 3848

This will open your file containing your original code as well as a new file that will output the unique lines per group.

seen is a set and is great for seeing if something exists within it already.

data is a list and will keep track of the iterations of "=Cluster=" groups.

Then you simply review each line of each of the groups (designated as i within data).

If the line does not exist within seen it is added.

with open ("input file", 'r') as in_file, open("output file", 'w') as out_file:
    data = [k.rstrip().split("=Cluster=") for k in in_file]
    for i in data:
        seen = set()
        for line in i:
            if line in seen:
                continue
            seen.add(line)
            out_file.write(line)

EDIT: Moved seen=set() to within the for i in data to reset the set each time otherwise "=Cluster=" would always exist and would not print for each group within data.

Upvotes: 2

Is there any way to remove duplicate string in a string based on pattern?

Answers (4)

Related Questions