Reputation: 101
I'm creating a regex. This is my test dataset:
<a href="test.html">test1</a>
<a href="test.pdf">test2</a>
<a href="test.html">test1</a>
<a href="test.html">test1</a><a href="testtime.pdf">test2</a>
I'm trying to capture from "href=" to "pdf", but the following regex:
href=.*?\.pdf
Will capture the right data if it is isolated to one line, but it will also match the following from the last line:
href="test.html">test1</a><a href="testtime.pdf
I only want from the last "href" to the ".pdf", I don't want the first "href" on the line or anything that comes between it and the second "href". Is it possible to modify the regex to match this properly?
Thanks.
Upvotes: 0
Views: 50
Reputation: 9650
Make the attribute to start with a quote and the value not contain this quote:
href="[^"]*?\.pdf
Demo: https://regex101.com/r/UuRin3/1
P.S.
Upvotes: 2
Reputation: 189
First of all, use capturing groups, they allow you match whole word, but extract only part of it, for example href=\"(.*\.pdf)\"
should allow you to match the href="xxxx.pdf"
string, but extract only xxxx.pdf
part.
How you do this depends on what technology you use to fetch Regex. Somehow I doubt this is html.
Upvotes: 0