Sriram
Sriram

Reputation: 2999

RegEx to extract text between a HTML tag

I'm looking a regular expression which must extract text between HTML tag of different types.

For ex:

<span>Span 1</span> - O/p: Span 1

<div onclick="callMe()">Span 2</div> - O/p: Span 2

<a href="#">HyperText</a> - O/p: HyperText

I found this particular piece <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1> from here But this one is not working.

Upvotes: 4

Views: 37459

Answers (4)

Ammy
Ammy

Reputation: 379

Matcher matcher = Pattern.compile("<([a-zA-Z]+).*>(.+)</\\1+>")
    .matcher("<a href=\"#\">HyperText</a>");

while (matcher.find())
{
    String matched = matcher.group(2);

    System.out.println(matched + " found at "
        + "\n"
        + "start at :- " + matcher.start()
        + "\n"
        + "end at :- " + matcher.end()
        + "\n");
}

Upvotes: 1

sp00m
sp00m

Reputation: 48817

This should suit your needs:

<([a-zA-Z]+).*?>(.*?)</\\1>

The first group contains the tag name, the second one the value inbetween.

Upvotes: 1

MikeM
MikeM

Reputation: 13631

Your comment shows that you have neglected to escape the backslashes in your regex string.

And if you want to match lowercase letters add a-z to the character classes or use Pattern.CASE_INSENSITIVE (or add (?i) to the beginning of the regex)

"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>"

If the tag contents may contain newlines, then use Pattern.DOTALL or add (?s) to the beginning of the regex to turn on dotall/singleline mode.

Upvotes: 10

frickskit
frickskit

Reputation: 624

A very specific way:

(<span>|<a href="#">|<div onclick="callMe\(\)">)(.*)(</span>|</a>|</div>)

but yeah, this will only work for those 3 examples. You'll need to use an HTML parser.

Upvotes: -1

Related Questions