Reputation: 31

How to get only uniq column values from a line using linux or perl?

I have a file like below

1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K

I want the output to look like this

1 B C D
2 K B D
1 B C D
2 K B D
1 B C D
2 K B D

Sort commands don't work, so I tried writing a Perl program -

use strict;
use Data::Dumper; 

my $file1 = <$ARGV[0]>;

open (IF2, "$file1") || die "Cannot open the file\n";

open (OUT, ">$file1.out") || die "Cannot open the out file\n";



my $k = 0;my $i=0;
my @line;
my $m;
my @line2;



while ( chomp($m = <IF2>) ) {   
    my $count = 0;
    @line2 = split(/\t/,$m);#<stdin>;
    my $l = length @line2;print $l;<stdin>;
    for (my $x = 0; $x < $l;$x++) {
        my $k = 0;
        for (my $y = 0;$y < $l; $y) {
             $i++;
    #       
            if ($count == 0) 
            {
                print OUT "\t$line2[$x]";
                $count++;
            }

            if ($count != 0 && $x != $y)
            {
                if ($line2[$x] eq $line2[$y])
                {
                     $k++;
                }

            }
        }
        if ($k == 0)
        {
            print OUT "\t$line2[$x]";
        }
    }
  print OUT "\n";   
  }



  print $i;

  close IF2;
  close OUT;

But it didn't work. Can some one help?

Upvotes: 3

Answers (2)

zdim

Reputation: 66964

Note The input and output examples in the question were edited: now the data is consistent with the title, asking to remove all duplicates. (See the original question for how it looked.) I am leaving the answer to the original question as well, until we hear from the poster.

I would like to make a general remark first.

Your posted code is an honest attempt to write a C-style program in Perl. I recommend to resist that and to learn how to use Perl instead. It does take a little investment of time and effort but it goes quick and pays off quick. As you can see below, it makes many things incomparably easier.

Even so, there are issues with the posted code but I can't write a code review now.

Edited question

We need to remove all duplicates from each line. One way: split the line by whitespace then remove duplicates from the list, which is a standard task with ready solutions.

use warnings;
use strict;
use List::MoreUtils qw(uniq);

my $file = '...';
my $fileout = '...';

open my $fh,     '<', $filen   or die "Can't open $file: $!";
open my $fh_out, '>', $fileout or die "Can't open $fileout: $!";

while (<$fh>) 
{
    my @unique = uniq split;
    print $fh_out "@unique\n";
}
close $fh;
close $fh_out;

The line with spaces is restored by printing the quoted array, when spaces (or whatever is in $") are inserted between elements . The other option is to join the resulting list

my $res = join ' ', uniq split;
print $fh_out $res, "\n";

or just print $fh_out join(' ', uniq split), "\n";.

This uses uniq from List::MoreUtils module. Note from uniq that

The order of elements in the returned list is the same as in LIST.

Once List::MoreUtils got mentioned, take note of the core module List::Util as well.

A one-liner version

perl -MList::MoreUtils=uniq -lne'print join " ", uniq split' input > output

perl -MList::MoreUtils=uniq -lane'print join " ", uniq @F' input > output

See Command switches in perlrun

Original question (see it in the edit history)

I was answering the question with the belief that this was the input

1 B B C D B 2 K B D D K 1 B B C D B 2 K B D D K 1 B B C D B 2 K B D D K

and this was the desired output

1 B C D 2 K B D 1 B C D 2 K B D 1 B C D 2 K B D

By your desired output you want to remove only adjacent duplicates (not get "uniq" as the title says).

For that you can use the ability of regex to match repeated patterns, by using backreferences. First we need to strip all spaces, and we'll put them back in the end. For example

use warnings;
use strict;

my $file = '...';
my $fileout = '...';

open my $fh,     '<', $filen   or die "Can't open $file: $!";
open my $fh_out, '>', $fileout or die "Can't open $fileout: $!";

while (my $line = <$fh>) 
{
    $line =~ s/\s*//g;       # remove spaces  /
    $line =~ s/(.)\1+/$1/g;  # remove adjacent duplicates
    $line =~ s/(.)/$1 /g;    # restore space
    print $fh_out $line;
}
close $fh;
close $fh_out;

The . matches any character, replace it with something more restrictive if needed (for example \w, for a "word" character). See perlretut. Note that we cannot restore space along with replacement (like s/(.)\1+/$1 /g) since non-repeating chars aren't matched and don't get the space back.

This can be done in more concise ways.

A one-liner version

perl -pe's/\s*//g; s/(.)\1+/$1/g; s/(.)/$1 /g' input > output

See Command switches in perlrun

Upvotes: 4

Borodin

Reputation: 126772

I suggest something like this. It finds all occurrences of whitespace followed by non-whitespace, and checks whether the non-whitespace has been seen before on the current line. The matched substring is deleted if the non-whitespace has been seen before, otherwise it is left untouched

use strict;
use warnings 'all';

while ( <DATA> ) {
    my %seen;
    s/(\s+(\S+))/ $seen{$2}++ ? '' : $1 /eg;
    print;
}

__DATA__
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K

output

1 B C D
2 K B D
1 B C D
2 K B D
1 B C D
2 K B D

This may be done in a one-liner like this

perl -pe 'my %s; s/(\s+(\S+))/ $s{$2}++ ? "" : $1 /eg' myfile

Upvotes: 2

How to get only uniq column values from a line using linux or perl?

Answers (2)

output

Related Questions