turtle
turtle

Reputation: 75

Perl regex to capture group and stop matching

I need some help with this perl regular expression

s/.*?<\?lsmb if\s*?(\S*)\s*?\?>/$1/

in the code below parsing out some non-whitespace chars [A-Z][a-z][0-9][_] surrounded by any number of whitespace and the other chars. I have tried various Perl regular expressions which are all commented out in the program below.

My main problem I think is stopping matching at the end.

The code below runs 8 tests, and I am hoping to find something that passes all 8.

#!/usr/bin/perl

use strict;
use warnings;
use diagnostics;

my $count = 0;
my $t = 0;
#examples of things I need to match, match => catagory
my $self = {  'customerfax' => 'alpha',
             '_____' => 'Underscore',
             '000000' => 'numeric',
             'letter_reason_4' => 'alfa-numeric-underscore',
             'customerphone7' => 'alfa-numeric', 
             'customer_phone' => 'alfa-underscore',
           };
# must contain   <?lsmb 'varname from $self' ?> 
# may contain any amount of whitespace chars where one is depected
# will end with \n that is removed by chop below         
my $test1 = qq|<?lsmb if customerfax ?>  caacaacac\n|;
my $test2 = qq|<?lsmb if _____ ?> bbb\n|;
my $test3 = qq|<?lsmb if 000000 ?> cccc\n|;
my $test4 = qq|<?lsmb if letter_reason_4 ?><t \></'><><><>\n|;  # /
my $test5 = qq| <?lsmb if customerfax ?> |;
my $test6 = qq|<?lsmb if  customerphone7   ?> \<?lsmb ignore this >n|;
my $test7 = qq|<?lsmb if  customer_phone  ?>\n|;
my $test8 = qq| avcscc 34534534 <?lsmb if letter_reason_4 ?> 0xffff\n|;

strip_markup($test1);
strip_markup($test2);
strip_markup($test3);
strip_markup($test4);
strip_markup($test5);
strip_markup($test6);
strip_markup($test7);
strip_markup($test8);

if ($count == 8) { print "Passed All done\n";}
else { print "All done passed  $count out of 8 Try again \n"; }


sub strip_markup { 
    $_= shift;
    #print "strip_markup $_ \n";
    if (/<\?lsmb if /) {
        chop; # gets rid ot the new line
        #original
        #s/.*?<\?lsmb if (.+?) \?>/$1/;
        #What I have tried:
        #s/.*?<\?lsmb if(?:\s)*?(\S+?)(?:\s)*?\?>\b/$1/;
        s/.*?<\?lsmb if\s*?(\S*)\s*?\?>/$1/;
        #s/.*?<\?lsmb if\s*?([A-Za-z0-9_]*?)\s*?\?>/$1/;
        #s/.*?<\?lsmb if[\s]*?(\S*?)[\s]*?\?>/$1/;
        #s/.*?<\?lsmb if (\S*?) \?>/$1/;
        #s/.*?<\?lsmb if (\S+?) \?>/$1/;
        #s/.*?<\?lsmb if ([\S]+?)([\s]+?)\?>/$1/;
        #s/.*?<\?lsmb if[\s]+([\S]+)[\s]+\?>/$1/;
        #s/.*?<\?lsmb if\s*?([\S]*?)\s*?\?>/$1/;
        #s/.*?<\?lsmb if\s+?([\S]+?)[\s]+?\?>/$1/;
        #s/.*?<\?lsmb if ([\S]+?) \?>/$1/;
        #s/.*?<\?lsmb if\s*?([\S_]*?)\s*?\?>/$1/;
        #s/.*?<\?lsmb if\s*?([[a-zA-Z]|[\d]|[_]]*?)\s*?\?>/$1/;
        #s/.*?<\?lsmb if\s*?([a-zA-Z\d_]*?)\s*?\?>/$1/;
        #s/.*?<\?lsmb if\s*?([^[:space:]]+?)\s*?\?>/$1/;

        $t++;
        print "Test $t ";
        #look up the result as the hash key
        my $ok = $self->{$_};
        if ($ok) { 
                $count++;
                print "OK passed $ok,";
        }
        print qq|Test Value : '$_' \n|;
    }
}

Here are some of the Tests and what they should return:

Upvotes: 3

Views: 743

Answers (1)

zdim
zdim

Reputation: 66873

If my understanding of your requirements is right, the needed phrase is extracted by simple

my ($match) = $string =~ /<\?lsmb \s+ if \s+ (\w+)/x

In the list context the match operator m// returns a list with matches. Even if it's just one, we need the list context – in the scalar context its behavior is different. The list context comes from assigning to a list from it, my (...) =. The /x modifier merely allows us to use spaces inside, for readability. See perlretut for starters.

What may precede <? doesn't have to be specified, since the pattern matches anywhere in the string. The \w is for [A-Za-z0-9_] (see perlrecharclass), what seems to match your examples and description. The \S is more permissive. Nothing is needed after \w+.

Also, there is no need to first test whether the pattern is there

sub strip_markup 
{
    my ($test_res) = $_[0] =~ /<\?lsmb if (\w+)/;

    if ($test_res) {
        # ...
    }

    return $test_res;         # return something!
}

There is no reason for the substitution so we use a match.

I understand that you are working with code you can't change, but would still like to comment

  • No need to remove the newline here. But when you do that, use chomp and not chop

  • The sub uses global variables. That can lead to bugs. Declare in small scope. Pass

  • The sub modifies global variables. That often leads to bugs while there is rarely need for it

  • Use arrays for repetitions of the same thing

  • This can be organized differently, to separate work more clearly


For example

my @tests = (
    qq|<?lsmb if customerfax ?>  caacaacac\n|,
    # ...
);

my ($cnt, $t);

foreach my $test (@tests) 
{
    my $test_res = strip_markup($test);

    if (defined $test_res) {
        $t++;
        print "Test $t ";
        #look up the result as the hash key
        my $ok = $self->{$test_res};
        if ($ok) { 
                $count++;
                print "OK passed $ok,";
        }
        print qq|Test Value : '$_' \n|;
    }
    else { }  # report failure
}

sub strip_markup {
    my ($test_res) = $_[0] =~ /<\?lsmb \s+ if \s+ (\w+)/x;
    return $test_res;
}

The defined test of $test_res is to allow for falsey things (like 0 or '') to be valid results.

The reporting code can, and should be, in another subroutine.

Upvotes: 1

Related Questions