cucurbit
cucurbit

Reputation: 1442

Check duplicate values in two columns separately

This is my input file:

Region_1    Region_1276
Region_2    Region_47
Region_3    Region_50
Region_1    Region_14
Region_50   Region_1
Region_27   Region_4
Region_12   Region_4

And this is the output I want:

Region_1        Region_1276     Common_0
Region_2        Region_47       Common_1
Region_3        Region_50       Common_2
Region_1        Region_14       Common_0
Region_50       Region_1        Common_3
Region_27       Region_4        Common_4
Region_12       Region_4        Common_4

Goal: I want to add a new column with an unique ID for every line in my input. The problem is that sometimes column one has a duplicated value, in my example, "Region_1" in column1 has two lines. So, the ID must be the same for this two rows (Common_0 in the output). It happens the same with second column. If column 2 has a duplicated value, the ID must be the same (Common_4). I've been trying hard during the day but this is the 'closest' code that I've, and it is still far away from the solution:

 awk -v c=0 '{a[$1]++}END{for(x in a){if(a[x]>1){for(i=1;i<=a[x];i++){print x"\tCommon_"c}}else{print x"\tCommon_"c};c++}}'

Output of my code:

Region_27   Common_0
Region_12   Common_1
Region_1    Common_2
Region_1    Common_2
Region_50   Common_3
Region_2    Common_4
Region_3    Common_5

It's not OK because it only evaluates column1 and also I don't get print column2.

Upvotes: 2

Views: 69

Answers (3)

Borodin
Borodin

Reputation: 126722

In general this can't be done linearly and a graph needs to be built.

For instance, if I have

Region_A    Region_B
Region_C    Region_D

then on the face of it I have two independent records, and the solutions you have been given apply two distinct IDs. But if I then have

Region_C    Region_B

then that unifies the two previous records and they need to be given the same ID

Here's a solution using the Graph module. I've added the situation above to the end of your sample data to demonstrate that it works as expected. If you don't need the IDs assigned in the order of your input data, or if you are happy to have the outout in a different order from the input, then a simpler solution is possible

use strict;
use warnings 'all';

use Graph;

my @data = <DATA>;
chomp @data;

my $g = Graph::Undirected->new(vertices => \@data, );
my @vertices = $g->vertices;

for my $i ( 0 .. $#vertices ) {

    for my $j ( $i+1 .. $#vertices ) {

        my @ri = split ' ', $vertices[$i];
        my @rj = split ' ', $vertices[$j];

        if ( $ri[0] eq $rj[0] or $ri[1] eq $rj[1]) {
            $g->add_edge($vertices[$i], $vertices[$j]);
        }
    }
}

my %group;
{
    my $n = 0;
    for my $set ( $g->connected_components ) {
        $group{$_} = $n for @$set;
        ++$n;
    }
}

my @ids;
my $n = 0;
for ( @data ) {
    printf "%-16s%-16s%-16s\n", split, $ids[$group{$_}] //= 'common_' . $n++;
}


__DATA__
Region_1    Region_1276
Region_2    Region_47
Region_3    Region_50
Region_1    Region_14
Region_50   Region_1
Region_27   Region_4
Region_12   Region_4
Region_A    Region_B
Region_C    Region_D
Region_C    Region_B

Output

Region_1        Region_1276     common_0        
Region_2        Region_47       common_1        
Region_3        Region_50       common_2        
Region_1        Region_14       common_0        
Region_50       Region_1        common_3        
Region_27       Region_4        common_4        
Region_12       Region_4        common_4        
Region_A        Region_B        common_5        
Region_C        Region_D        common_5        
Region_C        Region_B        common_5        

Upvotes: 1

bkmoney
bkmoney

Reputation: 1256

If you put this into a.awk

{
    if (a[$1])
        print $0 a[$1]
    else if (b[$2])
        print $0 b[$2]
    else {
        print $0 " common_" c
        a[$1] = b[$2] = " common_" c
        c++
    }
}

And you do

awk -f a.awk -v c=0 foo.txt | column -t

You will get the desired result

Region_1   Region_1276  common_0
Region_2   Region_47    common_1
Region_3   Region_50    common_2
Region_1   Region_14    common_0
Region_50  Region_1     common_3
Region_27  Region_4     common_4
Region_12  Region_4     common_4

Upvotes: 1

Sobrique
Sobrique

Reputation: 53478

With perl I would tackle it like this:

#!/usr/bin/env perl
use strict;
use warnings;

use Data::Dumper;

my %id_for;
my $count = 0; 

while ( <> ) {
   chomp; 
   my ( $R1, $R2 ) = split; 
   my $id = $id_for{$R1} // $id_for{$R2} // $count++; 
   $id_for{$R1} //= $id; 
   $id_for{$R2} //= $id; 
   print join ( "\t", $R1, $R2, "common_".$id_for{$R1} ), "\n";

}

This doesn't do quite what you want though, because I get:

Region_1    Region_1276 common_0
Region_2    Region_47   common_1
Region_3    Region_50   common_2
Region_1    Region_14   common_0
Region_50   Region_1    common_2
Region_27   Region_4    common_3
Region_12   Region_4    common_3

Because it's matching that region 1 against the single list. Does this mean you have two lists?

E.g.:

#!/usr/bin/env perl
use strict;
use warnings;

my %column_A;
my %column_B;
my $count = 0; 

while ( <> ) {
   chomp; 
   my ( $R1, $R2 ) = split; 
   my $id = $column_A{$R1} // $column_B{$R2} // $count++; 
   $column_A{$R1} //= $id; 
   $column_B{$R2} //= $id; 
   print join ( "\t", $R1, $R2, "Common_".$id ), "\n";

}

This creates a list that looks like your desired output:

Region_1    Region_1276 Common_0
Region_2    Region_47   Common_1
Region_3    Region_50   Common_2
Region_1    Region_14   Common_0
Region_50   Region_1    Common_3
Region_27   Region_4    Common_4
Region_12   Region_4    Common_4

Upvotes: 1

Related Questions