user1447941
user1447941

Reputation: 3895

Regex will not match

This is my string:

<link href="/post?page=4&amp;tags=example" rel="last" title="Last Page">

From there I am trying to obtain the 4 out of that page parameter, using this regular expression:

link href="/post?page=(.*?)&amp;tags=(.*?)" rel="last"

I will then collect the 4 out of the first group, the tags parameter has a wildcard because the contents can change. However, I don't seem to be getting a match with this, can anyone help?

And I know I shouldn't be using regex to parse HTML, but this is just a small thing and it would be a waste to import a huge module for this.

Upvotes: 1

Views: 70

Answers (4)

Niet the Dark Absol
Niet the Dark Absol

Reputation: 324620

Assuming you are using a /regex literal/, you will need to escape the / in that path as \/.

Alternatively, it depends on how you are getting this string. Is it really typed that way, or is it part of an innerHTML that you are then reading out again? If that's the case, then the innerHTML won't be what you expect it to be, because the browser will "normalise" it.

If it is an innerHTML, then it'd be far easier to get the tag, then get the tag's href attribute, then regex that.

Upvotes: 3

Smileek
Smileek

Reputation: 2782

link href="/post\?page=(.*?)&amp;tags=(.*?)" rel="last"
You forgot the slash before ?

Upvotes: 1

Lady Serena Kitty
Lady Serena Kitty

Reputation: 59

I think it might be better to change your capture groups to something a little different, but will catch everything up to the terminating character:

link href="/post?page=([^&]+)&amp;tags=([^\"]+)" rel="last"

Using the negating character first in the character group tells the regex engine "capture all characters EXCEPT the ones listed here". This makes it very easy to capture everything up until it hits a termination character, such as the amperstand and double-quote. Assuming you're using PHP or Java, this should also slightly improve regex performance.

Upvotes: 1

speakr
speakr

Reputation: 4209

If the page parameter always comes first, try the PCRE /\?page=(\d+)/. Match group 1 will contain the page number.

Upvotes: 0

Related Questions