Reputation: 49
I'm a rookie using Python and I'm trying to scrape a list of URLs and from a website and send them to a .CSV file but I keep getting a bunch of URLs that are only partial. They don't have "https://www.example.com" before the rest of the URL. I've found that I need to add something like "['https://www.example.com{0}'.format(link) if link.startswith('/') else link for link in url_list]" into my code but where am I supposed to add it? And is that even what I should add? Thanks for any help! Here is my code:
url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
for link in url_list:
url=link.get('href')
if url:
writer.writerow([url])
f.close()
If you notice anything else that should be changed please let me know. Thank you!
Upvotes: 0
Views: 179
Reputation: 2926
A simple if
statement will achieve this. Just check for the existence of https://www.example.com
in the URL and if it doesnt exist, concatenate it.
url_list=soup.find_all('a')
with open('HTMLList.csv','w',newline="") as f:
writer=csv.writer(f,delimiter=' ',lineterminator='\r')
for link in url_list:
url=link.get('href')
# updated
if url != '#' and url is not None:
# added
if 'https://www.example.com' not in url:
url = 'https://www.example.com' + url
writer.writerow([url])
f.close()
Upvotes: 1