Reputation: 97
I have a string like
<dd>TF-AIDN, "Proposal for something...", Version 3.4, 18 November 2015 https://www.something.org/en/system/files/files/file-18nov15-en.pdf</dd>
How can I modify the following statement to extract URL from such a string?
urlfinder = re.compile(r"((https?):((//)|(\\\\))+[\w\d:#@%/;$()~_?\+-=\\\.&]*)", re.MULTILINE|re.UNICODE)
I am not able to figure out how can I modify the regular expression so that it takes <
as the end mark of a URL instead of a space.
Upvotes: 1
Views: 378
Reputation: 31035
You can use this regex instead:
(http[^<]+)
This will match a pattern having http and everything but <
Upvotes: 2