Reputation: 87

uniq first field without discarding the content of duplicated lines

This is the first time I face such a situation. I need to do uniq just for the first field but without getting rid of any content of the duplicated lines. Take this example

Input file

ENST000001.1    +   67208778    67210057
ENST000001.1    +   67208778    67210768
ENST000001.1    +   67208778    67208882
ENST000002.5    +   67208778    67213982
ENST000003.1    -   57463571    57463801
ENST000003.1    -   57476352    57476463
ENST000003.1    -   57476817    57476945

When I did (uniq -w 12), just the first field (which has only 12 chars) will be checked for duplicates in all other lines. The result will be like this:

ENST000001.1    +   67208778    67210057
ENST000002.5    +   67208778    67213982
ENST000003.1    -   57463571    57463801

The content of all duplicated lines are discarded and only the first line remained. What I am looking for is something like this

ENST000001.1    +   67208778_67210057  67208778_67210768  67208778_67208882 
ENST000002.5    +   67208778_67213982
ENST000003.1    -   57463571_57463801  57476352_57476463  57476817_57476945

How do I use uniq without losing the contents of the duplicated lines ?! Is there a way to do it in AWK/sed/perl ?

Upvotes: 1

Answers (4)

potong

Reputation: 58578

This might work for you (GNU sed):

sed -r ':a;$!N;s/^((\S+\s+\S+).*)\n\2/\1/;ta;s/\<([0-9]+)\s+([0-9]+)\>/\1_\2/g;P;D' file

Upvotes: 0

Vijay

Reputation: 67319

awk '{a[$1" "$2]=a[$1" "$2]" "$3" "$4;}END{for(i in a)print i,a[i]}' your_file

tested below:

> cat temp
ENST000001.1    +       67208778        67210057
ENST000001.1    +       67208778        67210768
ENST000001.1    +       67208778        67208882
ENST000002.5    +       67208778        67213982
ENST000003.1    -       57463571        57463801
ENST000003.1    -       57476352        57476463
ENST000003.1    -       57476817        57476945
> awk '{a[$1" "$2]=a[$1" "$2]" "$3" "$4;}END{for(i in a)print i,a[i]}' temp
ENST000002.5 +  67208778 67213982
ENST000003.1 -  57463571 57463801 57476352 57476463 57476817 57476945
ENST000001.1 +  67208778 67210057 67208778 67210768 67208778 67208882

if you are specific about underscore(_) use below:

> awk '{a[$1" "$2]=a[$1" "$2]" "$3"_"$4;}END{for(i in a)print i,a[i]}' temp
ENST000002.5 +  67208778_67213982
ENST000003.1 -  57463571_57463801 57476352_57476463 57476817_57476945
ENST000001.1 +  67208778_67210057 67208778_67210768 67208778_67208882
>

explanation:

->create an associative array a whose key will be first field+space+second field.

->value for each key is its previous value+thirdfield+underscore+4th field

->end block is executed after processing all the lines. and for loop wiill loop across the ass..array and prints its keys and values.

Since perl is also tagged, here is the perl solution:

perl -F -lane '$H{$F[0]." ".$F[1]}=$H{$F[0]." ".$F[1]}." ".$F[2]."_".$F[3];if(eof){foreach(keys %H){print $_,$H{$_}}}' your_file

the above perl solution works on command line itself.

Upvotes: 3

creaktive

Reputation: 5220

Here's a Perl one-liner:

perl -lane 'BEGIN{$"=v9}push@{$u{"@F[0,1]"}},"$F[2]_$F[3]"}{while(($k,$v)=each%u){print"@{[$k,@$v]}"}'

Expanded version:

#!/usr/bin/env perl
use strict;
use warnings;
BEGIN { $/ = "\n"; $\ = "\n"; $" = "\t" }
my %u;
while (<ARGV>) {
    chomp;
    my @F = split /\s+/;
    push @{$u{"@F[0, 1]"}}, "$F[2]_$F[3]";
}
while (my ($k, $v) = each %u) {
    print "@{[$k, @$v]}";
}

Upvotes: 0

simbabque

Reputation: 54381

In Perl, you can do it by grouping them in a hashref.

#!/usr/bin/perl
use strict;
use warnings;

my $lines;
while (<DATA>) {
  chomp;
  my @fields = split /\s+/;
  push @{ $lines->{"$fields[0] $fields[1]"} }, "$fields[2]_$fields[3]";
}

foreach my $line (sort keys %$lines) {
  print join("\t", $line, @{ $lines->{$line} }), "\n";
}
__DATA__
ENST000001.1    +   67208778    67210057
ENST000001.1    +   67208778    67210768
ENST000001.1    +   67208778    67208882
ENST000002.5    +   67208778    67213982
ENST000003.1    -   57463571    57463801
ENST000003.1    -   57476352    57476463
ENST000003.1    -   57476817    57476945

Upvotes: 1

uniq first field without discarding the content of duplicated lines

Answers (4)

Related Questions