bateman
bateman

Reputation: 527

Define regular expression that matches urls that end with digits unless anything else comes after

I'm using Scrapy to scrape a web site. I'm stuck at defining properly the rule for extracting links. Specifically, I need help to write a regular expression that allows urls like:

https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352 https://discuss.dwolla.com/t/enhancement-dwolla-php-updated-to-2-1-3/1180 https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108

while forbidding urls like this one

https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352/12

In other words, I want urls that end with digits (i.e., /1352 in the example abpve), unless after these digits there is anything after (i.e., /12 in the example above)

I am by no means an expert of regular expressions, and I could only come up with something like \/(\d+)$, or even this one ^https:\/\/discuss.dwolla.com\/t\/\S*\/(\d+)$, but both fail at excluding the unwanted urls since they all capture the last digits in the address.

--- UPDATE ---

Sorry for not being clear in the first place. This addition is to clarify that the digits at the of URLS can change, so the /1352 is not fixed. As such, another example of urls to be accepted is also:

https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108

Upvotes: 0

Views: 651

Answers (2)

Bohemian
Bohemian

Reputation: 424983

This is probably the simplest way:

[^\/\d][^\/]*\/\d+$

or to restrict to a particular domain:

^https?:\/\/discuss.dwolla.com\/.*[^\/\d][^\/]*\/\d+$

See live demo.

This regex requires the last part to be all digits, and the 2nd last part to have at least 1 non-digit.

Upvotes: 2

Puneeth Reddy V
Puneeth Reddy V

Reputation: 1568

Here is a java regex may fit your requirements in java style. You can specify number of digits N you are excepting in {N}

^https://discuss.dwolla.com/t/[\\w|-]+/[\\d]+$

Upvotes: 0

Related Questions