Reputation: 527
I need to get the value between href's double quotes(") that matches a specific pattern, I tried the above but I can't figure out what's wrong. When I find the pattern in the same line multiple times I get a huge group with information that I don't want:
href="(/namehere/nane2here/(option1|option2).*)"
I need the group between the parenthesis. This pattern repeats itself a lot of times in the string, they're all in the same line.
Example of a string I'm trying to get the values from:
<div>adasdsda<div>...lots of tags here... <a ... href="/name/name/option1/data1/data2"...anything here ...">src</a>...others HTML text here...<a ... href="/name/name/option2/data1"...
Upvotes: 0
Views: 495
Reputation: 124275
First of all, don't use regex on entire HTML structure. To learn why visit:
Instead try to parse HTML structure into object representing DOM which will let us easily traverse over all elements and find those which we are interested in.
One of (IMO) easiest to use HTML parsers can be found at https://jsoup.org/. Its big plus is support for CSS selector syntax to find elements. It is described at https://jsoup.org/cookbook/extracting-data/selector-syntax where we can find
[attr~=regex]
: elements with attribute values that match the regular expression; e.g.
img[src~=(?i)\.(png|jpe?g)]
In short [attr~=regex]
will let us fund any element whose value of specified attribute can be even partially matched by regex.
With this your code can look something like:
String yourHTML =
"<div>" +
" <a href='abc/def/1'>foo</a>" +
" <a href='abc/fed/2'>bar</a>" +
" <a href='abc/ghi/3'>bam</a>" +
"</div>";
Document doc = Jsoup.parse(yourHTML);
Elements elementsWithHref = doc.select("a[href~=^abc/(def|fed)]");
for (Element element : elementsWithHref){
String href = element.attr("href");
System.out.println(href);
}
Output:
abc/def/1
abc/fed/2
(notice that there is no abc/ghi/3
since ^abc/(def|fed)
can't be found in it)
Upvotes: 1
Reputation: 71
\b is used to matche a word boundary
href="(/namehere/nane2here/(\\boption1\\b)|(\\boption2\\b).*)"
Upvotes: 0
Reputation:
Try "(?si)<[\\w:]+(?=(?:[^>\"']|\"[^\"]*\"|'[^']*')*?(?<=\\s)href\\s*=\\s*(?:(['\"])\\s*((?:(?!\\1).)*?/namehere/nane2here/(?:option1|option2)(?:(?!\\1).)*)\\s*\\1))\\s+(?:\".*?\"|'.*?'|[^>]*?)+>"
feature :
Upvotes: 0