BeginnerPro
BeginnerPro

Reputation: 85

Java Regex to get the text from HTML anchor (<a>...</a>) tags

I'm trying to get a text within a certain tag. So if I have:

<a href="http://something.com">Found<a/>

I want to be able to retrieve the Found text.

I'm trying to do it using regex. I am able to do it if the <a href="http://something.com> stays the same but it doesn't.

So far I have this:

Pattern titleFinder = Pattern.compile( ".*[a-zA-Z0-9 ]* ([a-zA-Z0-9 ]*)</a>.*" );

I think the last two parts - the ([a-zA-Z0-9 ]*)</a>.* - are ok but I don't know what to do for the first part.

Upvotes: 6

Views: 9112

Answers (2)

Tim Pietzcker
Tim Pietzcker

Reputation: 336468

As they said, don't use regex to parse HTML. If you are aware of the shortcomings, you might get away with it, though. Try

Pattern titleFinder = Pattern.compile("<a[^>]*>(.*?)</a>", Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
Matcher regexMatcher = titleFinder.matcher(subjectString);
while (regexMatcher.find()) {
    // matched text: regexMatcher.group(1)
} 

will iterate over all matches in a string.

It won't handle nested <a> tags and ignores all the attributes inside the tag.

Upvotes: 6

user467871
user467871

Reputation:

str.replaceAll("</?a>", "");

Here is online ideone demo

Here is similar topic : How to remove the tags only from a text ?

Upvotes: 0

Related Questions