Reputation: 1442
This is my input file:
Region_1 Region_1276
Region_2 Region_47
Region_3 Region_50
Region_1 Region_14
Region_50 Region_1
Region_27 Region_4
Region_12 Region_4
And this is the output I want:
Region_1 Region_1276 Common_0
Region_2 Region_47 Common_1
Region_3 Region_50 Common_2
Region_1 Region_14 Common_0
Region_50 Region_1 Common_3
Region_27 Region_4 Common_4
Region_12 Region_4 Common_4
Goal: I want to add a new column with an unique ID for every line in my input. The problem is that sometimes column one has a duplicated value, in my example, "Region_1" in column1 has two lines. So, the ID must be the same for this two rows (Common_0 in the output). It happens the same with second column. If column 2 has a duplicated value, the ID must be the same (Common_4). I've been trying hard during the day but this is the 'closest' code that I've, and it is still far away from the solution:
awk -v c=0 '{a[$1]++}END{for(x in a){if(a[x]>1){for(i=1;i<=a[x];i++){print x"\tCommon_"c}}else{print x"\tCommon_"c};c++}}'
Output of my code:
Region_27 Common_0
Region_12 Common_1
Region_1 Common_2
Region_1 Common_2
Region_50 Common_3
Region_2 Common_4
Region_3 Common_5
It's not OK because it only evaluates column1 and also I don't get print column2.
Upvotes: 2
Views: 69
Reputation: 126722
In general this can't be done linearly and a graph needs to be built.
For instance, if I have
Region_A Region_B
Region_C Region_D
then on the face of it I have two independent records, and the solutions you have been given apply two distinct IDs. But if I then have
Region_C Region_B
then that unifies the two previous records and they need to be given the same ID
Here's a solution using the Graph
module. I've added the situation above to the end of your sample data to demonstrate that it works as expected. If you don't need the IDs assigned in the order of your input data, or if you are happy to have the outout in a different order from the input, then a simpler solution is possible
use strict;
use warnings 'all';
use Graph;
my @data = <DATA>;
chomp @data;
my $g = Graph::Undirected->new(vertices => \@data, );
my @vertices = $g->vertices;
for my $i ( 0 .. $#vertices ) {
for my $j ( $i+1 .. $#vertices ) {
my @ri = split ' ', $vertices[$i];
my @rj = split ' ', $vertices[$j];
if ( $ri[0] eq $rj[0] or $ri[1] eq $rj[1]) {
$g->add_edge($vertices[$i], $vertices[$j]);
}
}
}
my %group;
{
my $n = 0;
for my $set ( $g->connected_components ) {
$group{$_} = $n for @$set;
++$n;
}
}
my @ids;
my $n = 0;
for ( @data ) {
printf "%-16s%-16s%-16s\n", split, $ids[$group{$_}] //= 'common_' . $n++;
}
__DATA__
Region_1 Region_1276
Region_2 Region_47
Region_3 Region_50
Region_1 Region_14
Region_50 Region_1
Region_27 Region_4
Region_12 Region_4
Region_A Region_B
Region_C Region_D
Region_C Region_B
Region_1 Region_1276 common_0
Region_2 Region_47 common_1
Region_3 Region_50 common_2
Region_1 Region_14 common_0
Region_50 Region_1 common_3
Region_27 Region_4 common_4
Region_12 Region_4 common_4
Region_A Region_B common_5
Region_C Region_D common_5
Region_C Region_B common_5
Upvotes: 1
Reputation: 1256
If you put this into a.awk
{
if (a[$1])
print $0 a[$1]
else if (b[$2])
print $0 b[$2]
else {
print $0 " common_" c
a[$1] = b[$2] = " common_" c
c++
}
}
And you do
awk -f a.awk -v c=0 foo.txt | column -t
You will get the desired result
Region_1 Region_1276 common_0
Region_2 Region_47 common_1
Region_3 Region_50 common_2
Region_1 Region_14 common_0
Region_50 Region_1 common_3
Region_27 Region_4 common_4
Region_12 Region_4 common_4
Upvotes: 1
Reputation: 53478
With perl
I would tackle it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my %id_for;
my $count = 0;
while ( <> ) {
chomp;
my ( $R1, $R2 ) = split;
my $id = $id_for{$R1} // $id_for{$R2} // $count++;
$id_for{$R1} //= $id;
$id_for{$R2} //= $id;
print join ( "\t", $R1, $R2, "common_".$id_for{$R1} ), "\n";
}
This doesn't do quite what you want though, because I get:
Region_1 Region_1276 common_0
Region_2 Region_47 common_1
Region_3 Region_50 common_2
Region_1 Region_14 common_0
Region_50 Region_1 common_2
Region_27 Region_4 common_3
Region_12 Region_4 common_3
Because it's matching that region 1 against the single list. Does this mean you have two lists?
E.g.:
#!/usr/bin/env perl
use strict;
use warnings;
my %column_A;
my %column_B;
my $count = 0;
while ( <> ) {
chomp;
my ( $R1, $R2 ) = split;
my $id = $column_A{$R1} // $column_B{$R2} // $count++;
$column_A{$R1} //= $id;
$column_B{$R2} //= $id;
print join ( "\t", $R1, $R2, "Common_".$id ), "\n";
}
This creates a list that looks like your desired output:
Region_1 Region_1276 Common_0
Region_2 Region_47 Common_1
Region_3 Region_50 Common_2
Region_1 Region_14 Common_0
Region_50 Region_1 Common_3
Region_27 Region_4 Common_4
Region_12 Region_4 Common_4
Upvotes: 1