How to extract links from specific set of domains?

Question

I want to extract links from a webpage. The links should be from 3 domains only. How can i do it using BeautifulSoup?

I have the following code that works fine for extracting all links from the domain mentioned:

for link in soup.select("a[href^='http://ABCD.tv/']"):
    print link.get('href')

But I want to add another 2 domains like https://AABCD.tv and http://FFGV.VV

I tried the | operator but it does not work:

for link in soup.select("a[href^='http://ABCD.tv/'|'https://AABCD.tv'|'http://FFGV.VV']"):

Any help will be appreciated!

javidcf · Accepted Answer

I think what you need is:

for link in soup.select("a[href^='http://ABCD.tv/'],a[href^='https://AABCD.tv'],a[href^='http://FFGV.VV']"):

Or if you have a long list of URL bases you could do:

url_bases = ['http://ABCD.tv/', 'https://AABCD.tv', 'http://FFGV.VV']
for link in soup.select(','.join(f"a[href^='{base}']" for base in url_bases)):
    # ...

(replace f"a[href^='{base}']" with "a[href^='{}']".format(base) if using Python 3.5 or earlier)

How to extract links from specific set of domains?

Answers (1)

Related Questions