Reputation: 862

Remove repeated lines in a file based on pattern

I've tried to find a good way to carry out this, but unfortunatly I didn't find one.

I'm working with files with this format:

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

As you can see, every SPEC line is different, except the last one, where number of the string spectrum is repeated. What I'd like to do is take every chunk of information between the pattern =Cluster= and check if there are lines with spectrum value repeated. In case there are several lines repeated, removes all except one.

The output file should be like this:

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

=Cluster=
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

I'm using this to split the file using the pattern but I don't know how to check if there are spectrum repeated.

#!/usr/bin/perl

undef $/;
$_ = <>;
$n = 0;

for $match (split(/(?==Cluster=)/)) {
      open(O, '>temp' . ++$n);
      print O $match;
      close(O);
}

PD: I used Perl because it's easier for me, but I understand python as well.

Upvotes: 0

Answers (4)

dolmen

Reputation: 8696

The task seems easy enough to not require perl/python: use the uniq command to remove adjacent duplicate lines:

$ uniq < input.txt > output.txt

Upvotes: 0

Chiheb Nexus

Reputation: 9257

You can, also, use this pythonscript in which i used groupby from itertools module.

I assume your input file is called f_input.txt and the output file is called new_file.txt.

from itertools import groupby

data = (k.rstrip().split("=Cluster=") for k in open("f_input.txt", 'r'))
final = list(k for k,_ in groupby(list(data)))

with open("new_file.txt", 'a') as f:
    for k in final:
        if k == ['','']:
            f.write("=Cluster=\n")
        elif k == ['']:
            # write '\n\n' in Windows and '\n' in Linux (tested only in Windows!)
            f.write("\n\n")
        else:
            f.write("{}\n".join(k))

The output file new_file.txt will be similar to your desired output.

Upvotes: 1

Sobrique

Reputation: 53478

Something like this will remove duplicate lines (globally across the file).

#!/usr/bin/perl

use warnings;
use strict;

my %seen; 

while ( <> ) {
  next if ( m/SPEC/ and $seen{$_}++ );
  print;
}

If you want to be more specific about the spectrum value, for example:

next if ( m/spectrum=(\d+)/ and $seen{$1}++ );

As you're splitting out your clusters, you can do something quite similar, but just:

  if ( $line =~ m/==Cluster==/ ) { 
     open ( $output, ">", "temp".$count++ ); 
     select $output;
  }

This sets the default 'print' location to $output (you'll need to declare it outside your loop too.

You should also:

use strict; use warnings;
Avoid reading <> into $_, it's unnecessary. But it'd generally be better if you had to, to $block = do { local $/; <> }; instead. And then $block =~ m/regex/
Use lexical file handles: open ( my $output, '>', 'filename' ) or die $!;
check your return code on open (or die $! is usually sufficient).

So that would be something like:

#!/usr/bin/perl

use warnings;
use strict;

my %seen; 
my $count = 0; 
my $output; 

while (  <> ) {
  next if ( m/spectrum=(\d+)/ and $seen{$1}++ );
  if ( m/==Cluster==/ ) { 
     open ( $output, ">", "temp".$count++ ) or die $!; 
     select $output;
  }
  print;
}

Upvotes: 1

Toto

Reputation: 91385

If duplicate lines are consecutive, you could use this perl oneliner:

perl -ani.back -e 'next if defined($p) && $_ eq $p;$p=$_;print' file.txt

The original file is backup with extension .back

Upvotes: 0

Remove repeated lines in a file based on pattern

Answers (4)

Related Questions