Reputation: 8727
I have a task to extract the inner html text from an html link with Perl.
Here is an example,
<a href="www.stackoverflow.com">Regex Question</a>
I want to extract the string: Regex Question
Note that, the inner text might be empty like this. This example get an empty string.
<a href="www.stackoverflow.com"></a>
and the inner text might be enclosed with multiple tags like this.
<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>
I have tried to write a Perl regex for a while, but no success. especially, I don't know how to deal with multiple tags.
Upvotes: 1
Views: 352
Reputation: 26667
How about something like
(?<=>)[^<>\/]*(?=<\/)
will match the string: Regex Question
example: http://regex101.com/r/sG4bZ1/1
Upvotes: 0
Reputation: 35198
Use an HTML Parser for parsing HTML.
I suggest you take a look at Mojo::DOM
, and Mojo::UserAgent
if you need to download the content from the web.
The following will pull all the links with the href containing stackoverflow.com and display the text inside:
use strict;
use warnings;
use Mojo::DOM;
use Data::Dump;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
for my $link ($dom->find('a[href*="stackoverflow.com"]')->each) {
dd $link->all_text;
}
__DATA__
<html>
<body>
<a href="www.stackoverflow.com">Regex Question</a>
I want to extract the string: Regex Question
<a href="www.notme.com">Don't want this link</a>
Note that, the inner text might be empty like this. This example get an empty string.
<a href="www.stackoverflow.com"></a>
and the inner text might be enclosed with multiple tags like this.
<a href="www.stackoverflow.com"><b><h2>Regex Question with tags</h2></b></a>
</body>
</html>
Outputs:
"Regex Question"
""
"Regex Question with tags"
For a helpful 8 minute introductory video, check out Mojocast Episode 5.
Upvotes: 3
Reputation: 21666
Parsing HTML through Regex is a bad idea, you're not Chuck Norris. You can use Mojo::DOM module which will make your task very easy.
A sample:
use Mojo::DOM;
# Parse
my $dom = Mojo::DOM->new('<a href="www.stackoverflow.com"><b><h2>Regex Question</h2></b></a>');
# Find
say $dom->at('a')->text;
say $dom->find('a')->text;
To install Mojo::DOM just type the below command
$ cpan Mojo::DOM
Upvotes: 1
Reputation: 67968
<a[^>]*>(?:<[^>]*>)*([^<>]*)(?:<[^>]*>)*<\/a>
Try this.See demo.Grab the capture or match.
http://regex101.com/r/sU3fA2/1
Upvotes: 1
Reputation:
Should use html parser, but using a regex probably could be done.
This finds open to close A-tag pairs with no nesting of A-tags, and also
lets other tags be in the content.
If you want the a-tags content without other tags at all, it will be slightly different (not shown).
Since you are using Perl, this might work.
# =~ /(?s)<a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)(?<!\/>)((?:(?!(?><a(?>\s+(?:".*?"|'.*?'|[^>]*?)+>)|<\/a\s*>)).)*)<\/a\s*>/
(?s)
<a # Begin A-tag, must (should) contain attrib/val's
(?>
\s+ # (?!\s) add this if you think malformed '<a >' could slip by
(?: " .*? " | ' .*? ' | [^>]*? )+
>
)
(?<! /> ) # Lookbehind, Insure this is not a closed A-tag '<a/>'
( # (1 start), Capture Content between open/close A-tags
(?: # Cluster, match content
(?! # Negative assertion
(?>
<a # Not Start A-tag
(?>
\s+
(?: " .*? " | ' .*? ' | [^>]*? )+
>
)
| </a \s* > # and Not End A-tag
)
)
. # Assert passed, consume a content character
)* # End Cluster, do 0 to many times
) # (1 end)
</a \s* > # End A-tag
Upvotes: 0