Reputation: 862
I've tried to find a good way to carry out this, but unfortunatly I didn't find one.
I'm working with files with this format:
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
As you can see, every SPEC line is different, except the last one, where number of the string spectrum is repeated.
What I'd like to do is take every chunk of information between the pattern =Cluster=
and check if there are lines with spectrum value repeated. In case there are several lines repeated, removes all except one.
The output file should be like this:
=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
I'm using this to split the file using the pattern but I don't know how to check if there are spectrum repeated.
#!/usr/bin/perl
undef $/;
$_ = <>;
$n = 0;
for $match (split(/(?==Cluster=)/)) {
open(O, '>temp' . ++$n);
print O $match;
close(O);
}
PD: I used Perl because it's easier for me, but I understand python as well.
Upvotes: 0
Views: 101
Reputation: 8696
The task seems easy enough to not require perl/python: use the uniq
command to remove adjacent duplicate lines:
$ uniq < input.txt > output.txt
Upvotes: 0
Reputation: 9257
You can, also, use this python
script in which i used groupby
from itertools
module.
I assume your input file is called f_input.txt
and the output file is called new_file.txt
.
from itertools import groupby
data = (k.rstrip().split("=Cluster=") for k in open("f_input.txt", 'r'))
final = list(k for k,_ in groupby(list(data)))
with open("new_file.txt", 'a') as f:
for k in final:
if k == ['','']:
f.write("=Cluster=\n")
elif k == ['']:
# write '\n\n' in Windows and '\n' in Linux (tested only in Windows!)
f.write("\n\n")
else:
f.write("{}\n".join(k))
The output file new_file.txt
will be similar to your desired output.
Upvotes: 1
Reputation: 53478
Something like this will remove duplicate lines (globally across the file).
#!/usr/bin/perl
use warnings;
use strict;
my %seen;
while ( <> ) {
next if ( m/SPEC/ and $seen{$_}++ );
print;
}
If you want to be more specific about the spectrum value, for example:
next if ( m/spectrum=(\d+)/ and $seen{$1}++ );
As you're splitting out your clusters, you can do something quite similar, but just:
if ( $line =~ m/==Cluster==/ ) {
open ( $output, ">", "temp".$count++ );
select $output;
}
This sets the default 'print' location to $output
(you'll need to declare it outside your loop too.
You should also:
use strict;
use warnings;
<>
into $_
, it's unnecessary. But it'd generally be better if you had to, to $block = do { local $/; <> };
instead. And then $block =~ m/regex/
open ( my $output, '>', 'filename' ) or die $!;
or die $!
is usually sufficient). So that would be something like:
#!/usr/bin/perl
use warnings;
use strict;
my %seen;
my $count = 0;
my $output;
while ( <> ) {
next if ( m/spectrum=(\d+)/ and $seen{$1}++ );
if ( m/==Cluster==/ ) {
open ( $output, ">", "temp".$count++ ) or die $!;
select $output;
}
print;
}
Upvotes: 1
Reputation: 91385
If duplicate lines are consecutive, you could use this perl oneliner:
perl -ani.back -e 'next if defined($p) && $_ eq $p;$p=$_;print' file.txt
The original file is backup with extension .back
Upvotes: 0