Reputation: 2999
I'm looking a regular expression which must extract text between HTML tag of different types.
For ex:
<span>Span 1</span>
- O/p: Span 1
<div onclick="callMe()">Span 2</div>
- O/p: Span 2
<a href="#">HyperText</a>
- O/p: HyperText
I found this particular piece <([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
from here But this one is not working.
Upvotes: 4
Views: 37459
Reputation: 379
Matcher matcher = Pattern.compile("<([a-zA-Z]+).*>(.+)</\\1+>")
.matcher("<a href=\"#\">HyperText</a>");
while (matcher.find())
{
String matched = matcher.group(2);
System.out.println(matched + " found at "
+ "\n"
+ "start at :- " + matcher.start()
+ "\n"
+ "end at :- " + matcher.end()
+ "\n");
}
Upvotes: 1
Reputation: 48817
This should suit your needs:
<([a-zA-Z]+).*?>(.*?)</\\1>
The first group contains the tag name, the second one the value inbetween.
Upvotes: 1
Reputation: 13631
Your comment shows that you have neglected to escape the backslashes in your regex string.
And if you want to match lowercase letters add a-z
to the character classes or use Pattern.CASE_INSENSITIVE
(or add (?i)
to the beginning of the regex)
"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>"
If the tag contents may contain newlines, then use Pattern.DOTALL
or add (?s)
to the beginning of the regex to turn on dotall/singleline mode.
Upvotes: 10
Reputation: 624
A very specific way:
(<span>|<a href="#">|<div onclick="callMe\(\)">)(.*)(</span>|</a>|</div>)
but yeah, this will only work for those 3 examples. You'll need to use an HTML parser.
Upvotes: -1