Reputation: 73
I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.
The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.
lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles",
"Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
"Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
"Aluminium foil", "Foil trays",
"Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
"Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
"Glass bottles and jars"]
def site_scraper(site):
page_loc = ('//*[@id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
page = driver.find_element_by_xpath(page_loc)
page.click()
driver.execute_script("arguments[0].scrollIntoView(true);", page)
soup=BeautifulSoup(driver.page_source, 'lxml')
for i in x:
for j in y:
try:
material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
if material in lst:
df.at[code_no, material] = 1
else:
continue
continue
except IndexError:
continue
x = xrange(0,8)
y = xrange(0,9)
p = xrange(1,31)
for site in p:
site_scraper(site)
Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.
Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!
EDIT: by changing
except IndexError:
continue
to
except IndexError:
break
it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)
Upvotes: 1
Views: 70
Reputation: 54984
It sounds like you just need the text of those lis
:
lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")
Now you can use those for your logic:
for material in lis:
if material in lst:
df.at[code_no, material] = 1
Upvotes: 1