Reputation: 389

awk unix - match regex - regex string size limit | ideas?

The following code works as a minimal example. It searches a regular expression with one mismatch inside a text (later a large DNA file).

awk 'BEGIN{print match("CTGGGTCATTAAATCGTTAGC...", /.ATC|A.TC|AA.C|AAT./)}'

Later I am interested in the position where the regular expression is found. Therefore the awk command is more complex. Like it is solved here

If I want to search with more mismatches and a longer string I will come up with very long regex expressions:

example: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" with 3 mismatches "." allowed:
/
...AAAAAAAAAAAAAAAAAAAAAAAAAAA|
..A.AAAAAAAAAAAAAAAAAAAAAAAAAA|
..AA.AAAAAAAAAAAAAAAAAAAAAAAAA|
-
- and so on. (actually 4060 possibilities)

/

The problem with my solution is:

very long regex will not be accepted by awk! (limit seems to be at roughly about 80.000 characters)
Error: "bash: /usr/bin/awk: Argument list too long"
possible solution: SO-Link but I don't find the solution...

My question is:

Can I somehow still use the long regex expression?
- splitting the string and running the command multiple times could be a solution, but then I will get duplicated results.
Is there another way to approach this?
- ("agrep" will work, but not to find the positions)

Upvotes: 7

Answers (3)

dawg

Reputation: 104015

As Jonathan Leffler points out in comments your issue in the first case (bash: /usr/bin/awk: Argument list too long) is from the shell and you can solve that by putting your awk script in a file.

As he also points out, your fundamental approach is not optimal. Below are two alternatives.

Perl has many features that will aid you with this.

You can use the ^ XOR operator on two strings that will return \x00 where the strings match and another character where they don't match. March through the longer string XORing against the shorter with a max substitution count and there you are:

use strict;
use warnings;
use 5.014;

my $seq = "CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT";
my $pat     = "AAAAAA";
my $max_subs = 3;

my $len_in  = length $seq;
my $len_pat = length $pat;
my %posn;

sub strDiffMaxDelta {
    my ( $s1, $s2, $maxDelta ) = @_;
    
    # XOR the strings to find the count of differences
    my $diffCount = () = ( $s1 ^ $s2 ) =~ /[^\x00]/g;
    return $diffCount <= $maxDelta;
}

for my $i ( 0 .. $len_in - $len_pat ) { 
    my $substr = substr $seq, $i, $len_pat;
    # save position if there is a match up to $max_subs substitutions
    $posn{$i} = $substr if strDiffMaxDelta( $pat, $substr, $max_subs );
}

say "$_ => $posn{$_}" for sort { $a <=> $b } keys %posn;

Running this prints:

6 => AATCCA
9 => CCAGAA
10 => CAGAAC
11 => AGAACG
13 => AACGCA
60 => CAATCA
61 => AATCAA
62 => ATCAAA
63 => TCAAAT

Substituting:

$seq=AAATCGAAAAGCDFAAAACGT;
$pat=AATC;
$max_subs=1;

Prints:

1 => AATC
8 => AAGC
15 => AAAC

It is also easy (in the same style as awk) to convert this to 'magic input' from either stdin or a file.

You can also write a similar approach in awk:

echo "AAATCGAAAAGCDFAAAACGT" | awk -v mc=1 -v seq="AATC" '
{
for(i=1; i<=length($1)-length(seq)+1; i++) {
    cnt=0
    for(j=1;j<=length(seq); j++) 
        if(substr($1,i+j-1,1)!=substr(seq,j,1)) cnt++
    if (cnt<=mc) print i-1 " => " substr($1,i, length(seq)) 
    }
}'

Prints:

1 => AATC
8 => AAGC
15 => AAAC

And the same result with the longer example above. Since the input is moved to STDIN (or a file) and the regex does not need to be HUGE, this should get you started either with Perl or Awk.

(Be aware that the first character of a string is offset 1 in awk and offset 0 in Perl...)

Upvotes: 1

Wiktor Stribiżew

Reputation: 627086

Is there another way to approach this?

Looking for fuzzy matches is easy with Python. You just need to install the PyPi regex module by running the following in the terminal:

pip install regex # or pip3 install regex

and then create the Python script (named, say, script.py) like

#!/usr/bin/env python3
import regex
filepath = r'myfile.txt'
with open(filepath, 'r') as file:
    for line in file:
        for x in regex.finditer(r"(?:AATC){s<=1}", line):
            print(f'{x.start()}:{x.group()}')

Use the pattern you want, here, (?e)(?:AATC){s<=1} means you want to match AATC char sequence allowing one substitution at most in the match, with (?e) attempting to find a better fit.

Run the script using python3 script.py.

If myfile.txt contains just one AAATCGAAAAGCDFAAAACGT line, the output is

1:AATC
8:AAGC
15:AAAC

meaning that there are three matches at positions 1 (AATC), 8 (AAGC) and 15 (AAAC).

You can get the values themselves by replacing x.start() with x.group() in the Python script.

See an online Python demo:

import regex
line='AAATCGAAAAGCDFAAAACGT'
for x in regex.finditer(r"(?:AATC){s<=1}", line):
    print(f'{x.start()}:{x.group()}')

Upvotes: 0

Kaz

Reputation: 58617

The "Argument list too long" problem is not from Awk. You're running into the operating system's memory size limit on the argument material that can be passed to a child process. You're passing the Awk program to Awk as a very large command line argument.

Don't do that; put the code into a file, and run it with awk -f file, or make the file executable and put a #!/usr/bin/awk -f or similar hash-bang line at the top.

That said, it's probably not such such great idea to include your data in the program source code as a giant literal.

Upvotes: 0

awk unix - match regex - regex string size limit | ideas?

Answers (3)

Related Questions