Reputation: 87
This is the first time I face such a situation. I need to do uniq just for the first field but without getting rid of any content of the duplicated lines. Take this example
Input file
ENST000001.1 + 67208778 67210057
ENST000001.1 + 67208778 67210768
ENST000001.1 + 67208778 67208882
ENST000002.5 + 67208778 67213982
ENST000003.1 - 57463571 57463801
ENST000003.1 - 57476352 57476463
ENST000003.1 - 57476817 57476945
When I did (uniq -w 12), just the first field (which has only 12 chars) will be checked for duplicates in all other lines. The result will be like this:
ENST000001.1 + 67208778 67210057
ENST000002.5 + 67208778 67213982
ENST000003.1 - 57463571 57463801
The content of all duplicated lines are discarded and only the first line remained. What I am looking for is something like this
ENST000001.1 + 67208778_67210057 67208778_67210768 67208778_67208882
ENST000002.5 + 67208778_67213982
ENST000003.1 - 57463571_57463801 57476352_57476463 57476817_57476945
How do I use uniq without losing the contents of the duplicated lines ?! Is there a way to do it in AWK/sed/perl ?
Upvotes: 1
Views: 210
Reputation: 58381
This might work for you (GNU sed):
sed -r ':a;$!N;s/^((\S+\s+\S+).*)\n\2/\1/;ta;s/\<([0-9]+)\s+([0-9]+)\>/\1_\2/g;P;D' file
Upvotes: 0
Reputation: 67221
awk '{a[$1" "$2]=a[$1" "$2]" "$3" "$4;}END{for(i in a)print i,a[i]}' your_file
tested below:
> cat temp
ENST000001.1 + 67208778 67210057
ENST000001.1 + 67208778 67210768
ENST000001.1 + 67208778 67208882
ENST000002.5 + 67208778 67213982
ENST000003.1 - 57463571 57463801
ENST000003.1 - 57476352 57476463
ENST000003.1 - 57476817 57476945
> awk '{a[$1" "$2]=a[$1" "$2]" "$3" "$4;}END{for(i in a)print i,a[i]}' temp
ENST000002.5 + 67208778 67213982
ENST000003.1 - 57463571 57463801 57476352 57476463 57476817 57476945
ENST000001.1 + 67208778 67210057 67208778 67210768 67208778 67208882
if you are specific about underscore(_
) use below:
> awk '{a[$1" "$2]=a[$1" "$2]" "$3"_"$4;}END{for(i in a)print i,a[i]}' temp
ENST000002.5 + 67208778_67213982
ENST000003.1 - 57463571_57463801 57476352_57476463 57476817_57476945
ENST000001.1 + 67208778_67210057 67208778_67210768 67208778_67208882
>
explanation:
->create an associative array a whose key will be first field+space+second field.
->value for each key is its previous value+thirdfield+underscore+4th field
->end block is executed after processing all the lines. and for loop wiill loop across the ass..array and prints its keys and values.
Since perl is also tagged, here is the perl solution:
perl -F -lane '$H{$F[0]." ".$F[1]}=$H{$F[0]." ".$F[1]}." ".$F[2]."_".$F[3];if(eof){foreach(keys %H){print $_,$H{$_}}}' your_file
the above perl solution works on command line itself.
Upvotes: 3
Reputation: 5210
Here's a Perl one-liner:
perl -lane 'BEGIN{$"=v9}push@{$u{"@F[0,1]"}},"$F[2]_$F[3]"}{while(($k,$v)=each%u){print"@{[$k,@$v]}"}'
Expanded version:
#!/usr/bin/env perl
use strict;
use warnings;
BEGIN { $/ = "\n"; $\ = "\n"; $" = "\t" }
my %u;
while (<ARGV>) {
chomp;
my @F = split /\s+/;
push @{$u{"@F[0, 1]"}}, "$F[2]_$F[3]";
}
while (my ($k, $v) = each %u) {
print "@{[$k, @$v]}";
}
Upvotes: 0
Reputation: 54323
In Perl, you can do it by grouping them in a hashref.
#!/usr/bin/perl
use strict;
use warnings;
my $lines;
while (<DATA>) {
chomp;
my @fields = split /\s+/;
push @{ $lines->{"$fields[0] $fields[1]"} }, "$fields[2]_$fields[3]";
}
foreach my $line (sort keys %$lines) {
print join("\t", $line, @{ $lines->{$line} }), "\n";
}
__DATA__
ENST000001.1 + 67208778 67210057
ENST000001.1 + 67208778 67210768
ENST000001.1 + 67208778 67208882
ENST000002.5 + 67208778 67213982
ENST000003.1 - 57463571 57463801
ENST000003.1 - 57476352 57476463
ENST000003.1 - 57476817 57476945
Upvotes: 1