rollsch
rollsch

Reputation: 2780

How to stop .+ at the first instance of a character and not the last with regular expressions in perl?

I want to replace:

'''<font size="3"><font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font></font>'''

With:

='''<font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font>'''=

Now my existing code is:

$html =~ s/\n(.+)<font size=\".+?\">(.+)<\/font>(.+)\n/\n=$1$2$3=\n/gm

However this ends up with this as the result:

=''' SUMMER/WINTER CONFIGURATION FILES</font>'''=

Now I can see what is happening, it is matching <font size ="..... all the way up to the end of the <font colour blue"> which is not what I want, I want it to stop at the first instance of " not the last, I thought that is what putting the ? mark there would do, however I've tried .+ .+? .* and .*? with the same result each time.

Anyone got any ideas what I am doing wrong?

Upvotes: 5

Views: 2281

Answers (3)

Pedro Silva
Pedro Silva

Reputation: 4700

As Mark said, just use CPAN for this.

#!/usr/bin/env perl

use strict; use warnings;
use HTML::TreeBuilder;

my $s = q{<font size="3"><font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font></font>};

my $tree = HTML::TreeBuilder->new;
$tree->parse( $s ); 
print $tree->find_by_attribute( color => 'blue' )->as_HTML;

# => <font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font>

This works for your specific case, however:

#!/usr/bin/env perl

use strict; use warnings;

my $s = q{<font size="3"><font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font></font>};

print $s =~ m{
                 < .+? >
                 (.+)?
                 </.+? >                
             }mx;

# => <font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font>

Upvotes: 4

Jon
Jon

Reputation: 16726

You could change .+ to [^"]+ (instead of "match anything", "match anything that isn't a ""...

Upvotes: 7

Mark Byers
Mark Byers

Reputation: 838796

Write .+? in all places to make each match non-greedy.

$html =~ s/\n(.+?)<font size=\".+?\">(.+?)<\/font>(.+?)\n/\n=$1$2$3=\n/gm
                ^                ^      ^            ^

Also try to avoid using regular expressions to parse HTML. Use an HTML parser if possible.

Upvotes: 8

Related Questions