Reputation: 33
I am looking to remove the entire link:
https://www.linkedin.com/in/ACoAAAJv1l4BATlBOVqhEEaqrVNojJPWnID9Nk0
When the link contains ACo
the regex should remove from my pattern the entire link.
regex2 = re.compile(r"\bhttps?://www.linkedin.com/in/\b[^in]+")
For some reason I am not getting this to work, the idea is to remove when the behaviour of the link starts with 'ACo' (Capital A and Capital C) after the /in/
We have 4 links, I am only want to print, https://www.linkedin.com/in/joao1
and https://www.linkedin.com/in/joao2
.
unique_hrefs = ['https://www.linkedin.com/in/joao1','https://www.linkedin.com/in/joao2','https://www.linkedin.com/in/ACoAAAI3JyABlHv1LxXa27GHFneEbdrqAtMu9eY','https://www.linkedin.com/in/ACoAABWYG0kB8IXhFzDTCFGOwAZ18YbXprOLcmg']
regex = re.compile(r"\bhttps?://www.linkedin.com/in/\b[^in]+")
regex2 = re.compile(r"""\bhttps?://www\.linkedin\.com/in/ACo[^<>"'\s]*""")
filtered = [i for i in unique_hrefs if regex.search(i) and regex2.search(i)]
for i in filtered:
print(i)
Upvotes: 2
Views: 77
Reputation: 18621
Use
import re
unique_hrefs = ['https://www.linkedin.com/in/joao1','https://www.linkedin.com/in/joao2','https://www.linkedin.com/in/ACoAAAI3JyABlHv1LxXa27GHFneEbdrqAtMu9eY','https://www.linkedin.com/in/ACoAABWYG0kB8IXhFzDTCFGOwAZ18YbXprOLcmg']
pattern = re.compile(r'https?://www\.linkedin\.com/in/ACo')
results = list(filter(lambda x: not pattern.match(x), unique_hrefs))
print(results)
See Python proof.
Results: ['https://www.linkedin.com/in/joao1', 'https://www.linkedin.com/in/joao2']
.
Upvotes: 1