Reputation: 3466
I like to extract text from html page using regular expressions. Here is my code:
String regExp="<h3 class=\"field-content\"><a[^>]*>(\\w+)</a></h3>";
Pattern regExpMatcher=Pattern.compile(regExp,Pattern.UNICODE_CHARACTER_CLASS);
String example="<h3 class=\"field-content\"><a href=\"/humana-akcija-na-kavadarechkite-navivachi-lozari\">Проба 1</a></h3><h3 class=\"field-content\"><a href=\"/opshtina-berovo-ne-mozhe-da-sostavi-sovet-0\">Проба 2</a></h3>";
Matcher m=regExpMatcher.matcher(example);
while(m.find())
{
System.out.println(m.group(1));
}
I like to get the values Проба 1
and Проба 2
. However I only get the first value Проба 1
. What is my problem?
Upvotes: 1
Views: 648
Reputation: 89547
To discover the power of the dark side, you can try this pattern:
<h3 class=\"field-content\"><a[^>]*>([^<]+)</a></h3>
Don't forget to set the UNICODE_CASE before.
Upvotes: 1
Reputation: 124215
It is blasphemy to use regex + HTML. But if you really want to be cursed then here it is (you have been warned):
String regExp = "<h3 class=\"field-content\"><a[^>]*>([\\w\\s]+)</a></h3>";
^updated part
Since Проба 1
and Проба 2
contains also spaces you need to include \\s
to your pattern.
Upvotes: 5