Katori
Katori

Reputation: 101

Narrowing Regex results

I'm creating a regex. This is my test dataset:

<a href="test.html">test1</a>
<a href="test.pdf">test2</a>
<a href="test.html">test1</a>
<a href="test.html">test1</a><a href="testtime.pdf">test2</a>

I'm trying to capture from "href=" to "pdf", but the following regex:

href=.*?\.pdf

Will capture the right data if it is isolated to one line, but it will also match the following from the last line:

href="test.html">test1</a><a href="testtime.pdf

I only want from the last "href" to the ".pdf", I don't want the first "href" on the line or anything that comes between it and the second "href". Is it possible to modify the regex to match this properly?

Thanks.

Upvotes: 0

Views: 50

Answers (2)

Dmitry Egorov
Dmitry Egorov

Reputation: 9650

Make the attribute to start with a quote and the value not contain this quote:

href="[^"]*?\.pdf

Demo: https://regex101.com/r/UuRin3/1

P.S.

Don't use Regex to parse HTML

Upvotes: 2

schroedingersKat
schroedingersKat

Reputation: 189

First of all, use capturing groups, they allow you match whole word, but extract only part of it, for example href=\"(.*\.pdf)\" should allow you to match the href="xxxx.pdf" string, but extract only xxxx.pdf part.

How you do this depends on what technology you use to fetch Regex. Somehow I doubt this is html.

Upvotes: 0

Related Questions