Reputation: 5355
I have the following text:
  submitted by   <a href="https://www.reddit.com/user/Leon91"> /u/Leon91 </a> <br/> <span><a href="https://www.dailymail.co.uk/news/article-7646171/Jared-Kushner-greenlit-arrest-Jamal-Khashoggi-phone-call-Saudi-Prince.html">[link]</a></span>   <span><a href="https://www.reddit.com/r/worldnews/comments/drfnas/jared_kushner_greenlit_arrest_of_jamal_khashoggi/">[comments]</a></span>
I would like to get all links that are NOT from reddit.com
, such as a result that link https://www.dailymail.co.uk/news/article-7646171/Jared-Kushner-greenlit-arrest-Jamal-Khashoggi-phone-call-Saudi-Prince.html
.
I tried the following, which matches ALL urls:
(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})
However, I would like to have all urls that are NOT from reddit.com.
Any suggestions how to approach this?
I appreciate your replies!
Upvotes: 1
Views: 47
Reputation:
Getting all 'a'
tag href links using regex that doesn't contain reddit.com
can be done like this :
The link is captured in group 2.
<a(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(?:(['"])((?:(?!\1|reddit\.com)[\S\s])+)\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>
https://regex101.com/r/UxKB0a/1
Upvotes: 2