user3741035
user3741035

Reputation: 2545

match string between columns using perl

I want to compare a string in column A with that in column B for every row and print a third column that highlights the differences.

Column A                      Column B
uuaaugcuaauugugauaggggu       uuaaugcuaauugugauaggggu
uuaaugcuaauugugauagggguu      uuaaugcuaauugugauaggggu
uuaaugcuaauugugauagggguuu     uuaaugcuaauugugauaggggu

Desired Result:

Column A                      Column B                Column C
uuaaugcuaauugugauaggggu       uuaaugcuaauugugauaggggu ********************
uuaaugcuaauugugauagggguu      uuaaugcuaauugugauaggggu ********************u
uuaaugcuaauugugauagggguuu     uuaaugcuaauugugauaggggu ********************uu

I have an example script that might work, but how do I do this for every row in the data frame?

use strict;
use warnings;
my $string1 = 'AAABBBBBCCCCCDDDDD';
my $string2 = 'AEABBBBBCCECCDDDDD';
my $result = '';
for(0 .. length($string1)) {
    my $char = substr($string2, $_, 1);
    if($char ne substr($string1, $_, 1)) {
        $result .= "**$char**";
    } else {
        $result .= $char;
    }
}
print $result;

Upvotes: 0

Views: 201

Answers (2)

jjpcondor
jjpcondor

Reputation: 1416

I could not resist to provide a modified Miller's solution with regular expressions

   use strict;
   use warnings;

   while (<DATA>) {
    my $masked_str1 ="";
    my $masked_str2 ="";
    my ($str1, $str2) = split;

    my $intersection = $str1 ^ $str2;
    while ($intersection =~ /(\x00+)/g) {

        my $mask = $intersection;
        $mask =~ s/\x00/1/g;
        $mask =~ s/[^1]/0/g;

        while ( $mask =~ /\G(.)/gc ) { # traverse the mask
           my $bit = $1;
           if ( $str1 =~ /\G(.)/gc ) { # traverse the string1 to be masked
                $masked_str1 .= $bit ? '_' : $1;
           }
           if ( $str2 =~ /\G(.)/gc ) { # traverse the string2 to be masked
                $masked_str2 .= $bit ? '_' : $1;
           }
        }

    }
    print "=" x 80;
    printf "\n%-30s %s\n", $str2, $str1; # Minimum length 30 char, left-justified
    printf "%-30s %s\n", $str1, $str2;  
    printf "%-30s %s\n\n", $masked_str1, $masked_str2;  


}

Upvotes: 0

Miller
Miller

Reputation: 35198

Using bruteforce and substr

use strict;
use warnings;

while (<DATA>) {
    my ($str1, $str2) = split;
    my $len = length $str1 < length $str2 ? length $str1 : length $str2;
    for my $i (0..$len-1) {
        my $c1 = substr $str1, $i, 1;
        my $c2 = substr $str2, $i, 1;
        if ($c1 eq $c2) {
            substr $str1, $i, 1, '*';
            substr $str2, $i, 1, '*';
        }
    }
    printf "%-30s %s\n", $str1, $str2;
}

__DATA__
Column_A                      Column_B
uuaaugcuaauugugauaggggu       uuaaugcuaauugugauaggggu
uuaaugcuaauugugauagggguu      uuaaugcuaauugugauaggggu
uuaaugcuaauugugauagggguuu     uuaaugcuaauugugauaggggu
AAABBBBBCCCCCDDDDD            AEABBBBBCCECCDDDDD

Outputs:

*******A                       *******B
***********************        ***********************
***********************u       ***********************
***********************uu      ***********************
*A********C*******             *E********E*******

Alternative using XOR

It's also possible to use ^ to find the intersection between two strings.

The following performs the same as the above:

while (<DATA>) {
    my ($str1, $str2) = split;

    my $intersection = $str1 ^ $str2;
    while ($intersection =~ /(\0+)/g) {
        my $len = length $1;
        my $pos = pos($intersection) - $len;
        substr $str1, $pos, $len, '*' x $len;
        substr $str2, $pos, $len, '*' x $len;
    }

    printf "%-30s %s\n", $str1, $str2;
}

Upvotes: 2

Related Questions