eawedat
eawedat

Reputation: 417

Extract href value which endsWith pdf

I want to get direct links of pdf from webpage, I tried this regex pattern but did not work with me:

href=.*\.pdf$

data to test:

<a class="btn btn-small pad-button" href="/Tests/English/english_2011_summer_A-Q_b.pdf">eng1</a><br>
<a href="english_2011_summer_A-Q_c.pdf">eng2</a>

Upvotes: 0

Views: 223

Answers (3)

depsai
depsai

Reputation: 415

Try this.

use group 1 and get the exact value.

href="([^"]+\.pdf)"

DEMO:http://regex101.com/r/nR8gY4/1

Upvotes: 0

hwnd
hwnd

Reputation: 70722

The main problem is the end of string $ anchor, the href values are not at this position. I can only recommend using a parser of sort to extract these values and if you want to use regex, I propose something like the following.

href=(["'])([^"']+\.pdf)\1

The values that you want as the match result can be accessed by capturing group #2

Upvotes: 3

Federico Piazza
Federico Piazza

Reputation: 30985

You can use this regex.

href=".*?([\w-]+\.pdf)"

Working demo

enter image description here

The idea of this regex is to look for all href witch contains X.pdf at the end.

Upvotes: 1

Related Questions