user15140929
user15140929

Reputation: 33

Regex removing when match the link behavior- Python

I am looking to remove the entire link:

https://www.linkedin.com/in/ACoAAAJv1l4BATlBOVqhEEaqrVNojJPWnID9Nk0

When the link contains ACo the regex should remove from my pattern the entire link.

regex2 = re.compile(r"\bhttps?://www.linkedin.com/in/\b[^in]+")

For some reason I am not getting this to work, the idea is to remove when the behaviour of the link starts with 'ACo' (Capital A and Capital C) after the /in/

We have 4 links, I am only want to print, https://www.linkedin.com/in/joao1 and https://www.linkedin.com/in/joao2.

unique_hrefs = ['https://www.linkedin.com/in/joao1','https://www.linkedin.com/in/joao2','https://www.linkedin.com/in/ACoAAAI3JyABlHv1LxXa27GHFneEbdrqAtMu9eY','https://www.linkedin.com/in/ACoAABWYG0kB8IXhFzDTCFGOwAZ18YbXprOLcmg']
    
regex = re.compile(r"\bhttps?://www.linkedin.com/in/\b[^in]+")

regex2 = re.compile(r"""\bhttps?://www\.linkedin\.com/in/ACo[^<>"'\s]*""")

filtered = [i for i in unique_hrefs if regex.search(i) and regex2.search(i)]

for i in filtered:
    print(i)

Upvotes: 2

Views: 77

Answers (1)

Ryszard Czech
Ryszard Czech

Reputation: 18621

Use

import re
unique_hrefs = ['https://www.linkedin.com/in/joao1','https://www.linkedin.com/in/joao2','https://www.linkedin.com/in/ACoAAAI3JyABlHv1LxXa27GHFneEbdrqAtMu9eY','https://www.linkedin.com/in/ACoAABWYG0kB8IXhFzDTCFGOwAZ18YbXprOLcmg']
pattern = re.compile(r'https?://www\.linkedin\.com/in/ACo')
results = list(filter(lambda x: not pattern.match(x), unique_hrefs))
print(results)

See Python proof.

Results: ['https://www.linkedin.com/in/joao1', 'https://www.linkedin.com/in/joao2'].

Upvotes: 1

Related Questions