Hobbit
Hobbit

Reputation: 35

Perl 5 regex match all non html-tags without variable length lookbehind

So I am trying to parse a web page for all the non html-tag matches. I was using RegExr and one of their sample patterns worked perfectly for what I need. The only problem is I am using Perl 5 and it keep spitting out this error:

Variable length lookbehind not implemented in regex m/((?<=^|>)[^><]+?(?=<|$))/ at POODLE_calc.pl line 36.

I've read many other posts on here about this error but still cant get it to work! I've tried rewriting the pattern as many different ways as I can think of or find on google and tried \K as suggested in one of the stackoverflow posts but still nothing works.

This is the excerpt from the HTML page I was experimenting on in RegExr (Full page made it crash)

<TABLE border cellspacing="2">
    <TR align="center">
        <TD width="50"> no. </TD> 
        <TD width="50"> AA </TD> 
        <TD width="50"> ORD/DIS </TD> 
        <TD width="50"> Prob. </TD> 
    </TR>
    <tr align="center">
        <td> 1 </td>
        <td> M </td>
        <td> -1 </td>
        <td> 0.1029 </td>
    </tr>

If you could help me figure out a pattern that will give me "no. AA ORD/DIS Prob. 1 M -1 0.1029" that Perl will accept I would greatly appreciate it!

Thanks,
Hobbit

EDIT

I used the pattern suggested by ikegami and it stopped the Perl error but it is only returning "no." and all of the space characters. Here is the code that is doing the parsing:

while (<FILE>){
    $_ =~ /((?:^|(?<=>))[^><]+?(?=<|$))/g;
    $proteinScores .= $1;
}
print $proteinScores."\n";

Upvotes: 0

Views: 83

Answers (2)

perreal
perreal

Reputation: 98048

This can help, assuming no text spans across lines and single text per line:

while (<DATA>){
    $proteinScores .= $1 if />([^>]+)</;
}

This one can do multiple texts per line:

while (<DATA>){
    $proteinScores .= $1 while />([^>]+)</g;
}

and this one can handle spanning text:

$text = join("", <DATA>);
$proteinScores .= $1 while $text =~ />([^<>]+)</g;

Upvotes: 1

ikegami
ikegami

Reputation: 386331

(?<=^|>) could be written as (?:(?<=^)|(?<=>)) which simplifies to (?:^|(?<=>))

Upvotes: 1

Related Questions