Reputation: 527
I'm using Scrapy to scrape a web site. I'm stuck at defining properly the rule for extracting links. Specifically, I need help to write a regular expression that allows urls like:
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352
https://discuss.dwolla.com/t/enhancement-dwolla-php-updated-to-2-1-3/1180
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
while forbidding urls like this one
https://discuss.dwolla.com/t/the-dwolla-reflector-is-now-open-source/1352/12
In other words, I want urls that end with digits (i.e., /1352
in the example abpve), unless after these digits there is anything after (i.e., /12
in the example above)
I am by no means an expert of regular expressions, and I could only come up with something like \/(\d+)$
, or even this one ^https:\/\/discuss.dwolla.com\/t\/\S*\/(\d+)$
, but both fail at excluding the unwanted urls since they all capture the last digits in the address.
--- UPDATE ---
Sorry for not being clear in the first place. This addition is to clarify that the digits at the of URLS can change, so the /1352
is not fixed. As such, another example of urls to be accepted is also:
https://discuss.dwolla.com/t/updated-java-android-helper-library-for-dwollas-api/108
Upvotes: 0
Views: 651
Reputation: 424983
This is probably the simplest way:
[^\/\d][^\/]*\/\d+$
or to restrict to a particular domain:
^https?:\/\/discuss.dwolla.com\/.*[^\/\d][^\/]*\/\d+$
See live demo.
This regex requires the last part to be all digits, and the 2nd last part to have at least 1 non-digit.
Upvotes: 2
Reputation: 1568
Here is a java regex may fit your requirements in java style. You can specify number of digits N you are excepting in {N}
^https://discuss.dwolla.com/t/[\\w|-]+/[\\d]+$
Upvotes: 0