Reputation: 189

Fastest way to search string with two mismatch in big files in perl?

I modified the code to work with two files. to_search.txt has string to be searched. big_file.fastq has lines where to be searched and if string found (2 mismatch allowed with exact length which range from 8-10, no addition and deletion), place in respective name. So each string is searched in all lines (2nd line) in big_file.fastq.

# to_search.txt: (length can be from 8-20 characters)

1       TCCCTTGT
2       ACGAGACT
3       GCTGTACG
4       ATCACCAG
5       TGGTCAAC
6       ATCGCACA
7       GTCGTGTA
8       AGCGGAGG
9       ATCCTTTG
10      TACAGCGC
#2000 search needed

# big_file.fastq: 2 billions lines (each 4 lines are associated: string search is in second line of each 4 lines). 
# Second line can have 100-200 characters
@M04398:19:000000000-APDK3:1:1101:21860:1000 1:N:0:1
TCttTTGTGATCGATCGATCGATCGATCGGTCGTGTAGCCTCCAACCAAGCACCCCATCTGTTCCAAATCTTCTCCCACTGCTACTTGAAGACGCTGAAGTTGAAGGGCCACCTTCATCATTCTGG
+
#8ACCDGGGGGGGGGGGGGEGGGGGGGGDFFEGGG@FFGGGGGGGGGGGGGGGGGCF@<FFGGGGGFE9FGGGFEACE8EFFFGGGGGF9F?CECEFEG,CFGF,CF@CCC?BFFG<,9<9AEFG,,
@M04398:19:000000000-APDK3:1:1101:13382:1000 1:N:0:1
NTCGATCGATCGATCGATCGATCGATCGTTCTGAGAGGTACCAACCAAGCACACCACGGGCGACACAGACAGCTCCGTGTTGAACGGGTTGTTCTTCTTCTTGCCTTCATCATCCCCATCCTCAGTGGACGCAGCTTGCTCATCCTTCCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@M04398:19:000000000-APDK3:1:1101:18888:1000 1:N:0:1
NCAGAATGAGGAAGGATGAGCCCCGTCGTGTCGAAGCTATTGACACAGCGCTATTCCGTCTTTATGTTCACTTTAAGCGGTACAAGGAGCTGCTTGTTCTGATTCAGGAACCGAACCCTGGTGGTGTGCTTGGTTGGCAAGTTTACGGCTC
+
#8BCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGE

Here is the code for two mismatches. I tried with exact match, speed is not bad: takes around a day. I have used Time::Progress module. When I use 2 mismatch: shows 115 days to finish. How the speed can be improved here?

#!/usr/bin/perl
use strict;
use warnings;

$| = 1;

open( IN_P1, "big_file.fastq" ) or die "File not found";

my ( @sample_file_names, @barcode1 );
open( BC_FILE, "to_search.txt" ) or die "No barcode file";
my @barcode_file_content = <BC_FILE>;

foreach (@barcode_file_content) {
    chomp $_;
    $_ =~ s/\r//;
    $_ =~ s/\n//;

    #print $_;
    my @elements = split( "(\t|,| )", $_ );
    push @sample_file_names, $elements[0];
    push @barcode1,   $elements[2];
}

# open FH
my @fh_array_R1;
foreach (@sample_file_names) {
    chomp $_;
    local *OUT_R1;
    open( OUT_R1, ">", "$_\.fq" ) or die "cannot write file";
    push @fh_array_R1, *OUT_R1;
}

# unknown barcode file
open( UNKNOWN_R1, ">unknown-barcode_SE.fq" ) or die "cannot create unknown-r1 file";

while ( defined( my $firstp1 = <IN_P1> ) ) {

    my $p1_first_line  = $firstp1;
    my $p1_second_line = <IN_P1>;
    my $p1_third_line  = <IN_P1>;
    my $p1_fourth_line = <IN_P1>;

    chomp( $p1_first_line, $p1_second_line, $p1_third_line, $p1_fourth_line, );

    my $matched_R1 = "$p1_first_line\n$p1_second_line\n$p1_third_line\n$p1_fourth_line\n";

    for ( my $j = 0 ; $j < scalar @barcode1 ; $j++ ) {
        chomp $barcode1[$j];

        my $barcode1_regex = make_barcode_fragments( $barcode1[$j] );

        if ( $p1_second_line =~ /$barcode1_regex/i ) {
            # keep if matched
            print { $fh_array_R1[$j] } $matched_R1;
            last;
        }
        else {
            #print to unknown;
            print UNKNOWN_R1 $matched_R1;
        }
    }
}

# make two mismatch patterm of barcode
sub make_barcode_fragments {
    my ($in1) = @_;
    my @subpats;
    for my $i ( 0 .. length($in1) - 1 ) {
        for my $j ( $i + 1 .. length($in1) - 1 ) {
            my $subpat = join( '',
                substr( $in1, 0, $i ),
                '\\w', substr( $in1, $i + 1, $j - $i - 1 ),
                '\\w', substr( $in1, $j + 1 ),
            );
            push @subpats, $subpat;
        }
    }
    my $pat = join( '|', @subpats );

    #print $pat;
    return "$pat";
}
exit;

Upvotes: 0

Answers (2)

aghast

Reputation: 15310

I'd suggest creating a really bad hashing algorithm. Something nice and reversible and inefficient, like the sum of the characters. Or maybe the sum of unique values (1-4) represented by the characters.

Compute the target sums, and also compute the maximum allowed variance. That is, if the objective is a match with two substitutions, then what is the maximum possible difference? (4-1 + 4-1 = 6).

Then, for each "window" of text of the appropriate length in the target data file, compute a running score. (Add a character to the end, drop a character from the start, update the hash score.) If the score for a window is within the allowable range, you can do further investigation.

You might want to implement this as different passes. Possibly even as different stages in a shell pipeline or script. The idea being that you might be able to parallelize parts of the search. (For instance, all the match strings with the same length could be searched by one process, since the hash windows are the same.)

Also, of course, it is beneficial that you can keep your early work if your program crashes in the later stages. And you can even have the early parts of the process running while you are still developing the end stages.

Upvotes: 0

Håkon Hægland

Reputation: 40748

If your algorithm cannot be changed/improved in Perl itself, you can still get speedup by writing the time consuming parts in C. Here is an example using inline C:

use strict;
use warnings;
use Benchmark qw(timethese);
use Inline C => './check_line_c.c';

my $find    = "MATCH1";
my $search = "saasdadadadadasd";

my %sub_info = (
    c    => sub { check_line_c( $find, $search ) },
    perl => sub { check_line_perl( $find, $search ) },
);

timethese( 4_000_000, \%sub_info );

sub check_line_perl {
    my ($find, $search ) = @_;

    my $max_distance = 2;

    for my $offset ( 0 .. length($search) - length($find) ) {
        my $substr = substr( $search, $offset, length($find) );
        my $hd = hd( $find, $substr );
        if ( $hd <= $max_distance ) {
            return ( $hd, $substr );
        }
    }
    return ( undef, undef );
}

sub hd {
    return ( $_[0] ^ $_[1] ) =~ tr/\001-\377//;
}

where check_line_c.c is:

void check_line_c( char* find, char * search ) {
    int max_distance = 2;
    int flen = strlen(find);
    int last_ind = strlen(search) - flen;

    SV *dis = &PL_sv_undef;
    SV *match = &PL_sv_undef;
    for ( int ind = 0; ind <= last_ind; ind++ ) 
    {
        int count = 0;
        for ( int j = 0; j < flen; j++ ) 
        {
            if ( find[j] ^ search[ind+j] ) count++; 
        }
        if ( count < max_distance )
        {
            match = newSV(flen);
            sv_catpvn(match, search+ind, flen );
            dis = newSViv(count);
            break;
        }
    }
    Inline_Stack_Vars;
    Inline_Stack_Reset;
    Inline_Stack_Push(sv_2mortal(dis));
    Inline_Stack_Push(sv_2mortal(match));
    Inline_Stack_Done;
}

The output is (Ubuntu Laptop using Intel Core i7-4702MQ CPU @2.20GHz):

Benchmark: timing 4000000 iterations of c, perl...
         c:  2 wallclock secs ( 0.76 usr +  0.00 sys =  0.76 CPU) @ 5263157.89/s (n=4000000)
      perl: 19 wallclock secs (18.30 usr +  0.00 sys = 18.30 CPU) @ 218579.23/s (n=4000000)

So this gives a 24-fold speedup for this case.

Upvotes: 1

Fastest way to search string with two mismatch in big files in perl?

Answers (2)

Related Questions