Reputation: 861
I have a list that I have scraped from a website. I want to remove the links that are anchors for the various pages of the site, for example '/about/'. There are a number of them. Rather than make different loops that remove from the list, is there a way that I can build a code that looks at the text and if "http" (not just https like the data below has because what if the "s" is not there) is in the text then it would add it to the list? My list data is this:
['mailto:[email protected]', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', '/events/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'mailto:[email protected]', '/about/', '/events/', '/news/', '/contact/', 'https://youtechassociates.com/', '/privacy-policy', '/terms-of-use', '/disclosure/']
Upvotes: 1
Views: 59
Reputation: 9375
I would go with list comprehension and startswith()
:
full_links = [link for link in links if link.startswith('http://') or link.startswith('https://')]
I think this is clearer than regex when you have such a simple task. Also, IMO you should ask for http://
and https://
explicitly, because only using http
might give you false positives if you meet relative links like http_stuff/foo.html
.
Upvotes: 0
Reputation: 2515
Here is a simple way to do this:
mlist = your-list-as-specified-above
newlist = []
for m in mlist:
if m.startswith('http'):
newlist.append(m)
Upvotes: 1
Reputation: 5713
You can use filter to get this result
a = ['mailto:[email protected]', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', '/events/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'mailto:[email protected]', '/about/', '/events/', '/news/', '/contact/', 'https://youtechassociates.com/', '/privacy-policy', '/terms-of-use', '/disclosure/']
b = filter(lambda x: 'http' not in x, a)
print(list(b))
Output:
['mailto:[email protected]', '/events/', 'mailto:[email protected]', '/about/', '/events/', '/news/', '/contact/', '/privacy-policy', '/terms-of-use', '/disclosure/']
Upvotes: 1
Reputation: 20414
You can use a list-comprehension with a regex to filter out links that do not contain the protocol:
[link for link in links if re.match('https?\:\/\/', link)]
giving:
['https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/', 'https://www.demodms.com/annuity/about/', 'https://www.demodms.com/annuity/services/', 'https://www.demodms.com/annuity/educational-courses/', 'https://www.demodms.com/annuity/events/', 'https://www.demodms.com/annuity/articles-and-downloads/', 'https://www.demodms.com/annuity/videos/', 'https://www.demodms.com/annuity/calculators/', 'https://www.demodms.com/annuity/news/', 'https://www.demodms.com/annuity/contact/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/tips-for-back-to-school-season/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-things-to-know-about-getting-life-insurance-for-your-child/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/5-signs-you-need-to-up-your-life-insurance-coverage/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://www.demodms.com/annuity/tips-for-summer-travel/', 'https://youtechassociates.com/']
Upvotes: 1