Regex, extract a href attribute from HTML with special name

Question

Having for example such a string:

Some Text.. ANYTHING .. Some Text SEARCHED_TEXT..

I need to extract from a HTML link a 'href' attribute value, from a link which contains some searched word like 'SEARCHED_TEXT' in example. Could you please advice, how to do it correctly? Would not ask if not sent much time already =)

I went till this, but unhopefully it works incorrectly..

String str = " Some Text.. ANYTHING .. Some Text SEARCHED_TEXT";
Pattern pattern = Pattern.compile("");
Matcher matcher = pattern.matcher(str);

while (matcher.find()) {
    System.out.println(matcher.group(0)); // matched whole string
    System.out.println(matcher.group(1)); // should be SEARCHED_HREF_TO_EXTRAC

I see that I need some negotation after href="(.*?)" to accept all symbols except

to find correct HREF, but can't make it work :(

Pshemo · Accepted Answer

Don't use regex here as it is not proper tool to handle nested structures (at last regex flavor used in Java since it doesn't support recursion) like HTML/XML
(more info: Can you provide some examples of why it is hard to parse XML and HTML with a regex?).

Proper tool is HTML/XML parser. I would probably choose jsoup because of its simplicity and CSS query support.

So your code could look like:

String html = " Some Text.. ANYTHING .. Some Text SEARCHED_TEXT";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a:contains(SEARCHED_TEXT)"); //contains is case-insensitive
System.out.println(links.attr("href"));

or if you expect to find many links iterate over found Elements and get href attribute from each of them:

for(Element link : links){
    System.out.println(link.attr("href"));
}

Regex, extract a href attribute from HTML with special name

Answers (2)

Related Questions