csstudent2x
csstudent2x

Reputation: 69

Reading in Content From URLS in a File

I'm trying to get other subset URLs from a main URL. However,as I print to see if I get the content, I noticed that I am only getting the HTML, not the URLs within it.

import urllib
file = 'http://example.com'

with urllib.request.urlopen(file) as url:
    collection = url.read().decode('UTF-8')

Upvotes: 0

Views: 77

Answers (1)

Upasana Mittal
Upasana Mittal

Reputation: 2680

I think this is what you are looking for. You can use beautiful soup library of python and this code should work with python3

    import urllib
    from urllib.request import urlopen
    from bs4 import BeautifulSoup

    def get_all_urls(url):
        open = urlopen(url)
        url_html = BeautifulSoup(open, 'html.parser')
        for link in url_html.find_all('a'):
            links = str(link.get('href'))
            if links.startswith('http'):
                print(links)
            else:
                print(url + str(links))
    get_all_urls('url.com')

Upvotes: 1

Related Questions