Thomas
Thomas

Reputation: 49

How to add "https://www.example.com/" before scraped URLs in Python that don't already have it

I'm a rookie using Python and I'm trying to scrape a list of URLs and from a website and send them to a .CSV file but I keep getting a bunch of URLs that are only partial. They don't have "https://www.example.com" before the rest of the URL. I've found that I need to add something like "['https://www.example.com{0}'.format(link) if link.startswith('/') else link for link in url_list]" into my code but where am I supposed to add it? And is that even what I should add? Thanks for any help! Here is my code:

url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
    writer=csv.writer(f,delimiter=' ',lineterminator='\r')
    for link in url_list:
        url=link.get('href')
        if url:
            writer.writerow([url])
f.close()

If you notice anything else that should be changed please let me know. Thank you!

Upvotes: 0

Views: 179

Answers (1)

Franco
Franco

Reputation: 2926

A simple if statement will achieve this. Just check for the existence of https://www.example.com in the URL and if it doesnt exist, concatenate it.

url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
    writer=csv.writer(f,delimiter=' ',lineterminator='\r')
    for link in url_list:
        url=link.get('href')
        # updated
        if url != '#' and url is not None:
            # added
            if 'https://www.example.com' not in url:
                url = 'https://www.example.com' + url
            writer.writerow([url])
f.close()

Upvotes: 1

Related Questions