Carol.Kar
Carol.Kar

Reputation: 5355

Get link(s) that are NOT from 'example.com'

I have the following text:

&#32; submitted by &#32; <a href="https://www.reddit.com/user/Leon91"> /u/Leon91 </a> <br/> <span><a href="https://www.dailymail.co.uk/news/article-7646171/Jared-Kushner-greenlit-arrest-Jamal-Khashoggi-phone-call-Saudi-Prince.html">[link]</a></span> &#32; <span><a href="https://www.reddit.com/r/worldnews/comments/drfnas/jared_kushner_greenlit_arrest_of_jamal_khashoggi/">[comments]</a></span>

I would like to get all links that are NOT from reddit.com, such as a result that link https://www.dailymail.co.uk/news/article-7646171/Jared-Kushner-greenlit-arrest-Jamal-Khashoggi-phone-call-Saudi-Prince.html.

I tried the following, which matches ALL urls:

(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})

However, I would like to have all urls that are NOT from reddit.com.

Any suggestions how to approach this?

I appreciate your replies!

Upvotes: 1

Views: 47

Answers (1)

user12097764
user12097764

Reputation:

Getting all 'a' tag href links using regex that doesn't contain reddit.com can be done like this :

The link is captured in group 2.

<a(?=\s)(?=(?:[^>"']|"[^"]*"|'[^']*')*?\shref\s*=\s*(?:(['"])((?:(?!\1|reddit\.com)[\S\s])+)\1))\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]*?)+>

https://regex101.com/r/UxKB0a/1

Upvotes: 2

Related Questions