Reputation: 2657
I am performing web scraping in via Python \ Selenium \ Chrome headless driver which involves executing a loop:
# perform loop
CustId=2000
while (CustId<=3000):
# Part 1: Customer REST call:
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId = CustId+1
# close driver at end of everything
driver.close()
However, sometime the page might not exist when the customer ID is certain number. I have no control over this and the code stops with page not found 404 error. How do I ignore this though and just move on with the loop?
I'm guessing I need a TRY....EXCEPT though?
Upvotes: 0
Views: 1778
Reputation: 193088
An ideal approach would be to use the range()
function and driver.quit()
at the end as follows:
for CustId in range(2000, 3000):
try:
urlg = f'https://mywebsite.com/customerRest/show/?id={str(CustId)}'
driver.get(urlg)
if not "404" in driver.page_source:
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
except:
continue
driver.quit()
Upvotes: 0
Reputation: 33384
You can check the page body h1
tag what the text appeared when it comes 404 error
and then you can put that in if clause to check if not then go inside the block.
CustId=2000
while (CustId<=3000):
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
if not "Page not found" in soup.find("body").text:
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId=CustId+1
Or
CustId=2000
while (CustId<=3000):
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
if not "404" in soup.find("body").text:
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId=CustId+1
Upvotes: 1
Reputation: 13
Maybe a way to do this would be to try:
try:
urlg = f'https://mywebsite.com/customerRest/show/?id={CustId}'
driver.get(urlg)
soup = BeautifulSoup(driver.page_source,"lxml")
dict_from_json = json.loads(soup.find("body").text)
#logic for webscraping is here......
CustId = CustId+1
except:
print("404 error found, moving on")
CustId = CustId+1
Sorry if this doesn't work, I havent tested it out.
Upvotes: 0