Reputation: 23
My "for" loop is scraping from multiple pages (in this case, three that I put into a list). But the print output/csv output is not picking up the previous iterations through the loop (it's only giving me the results of the final, third page). I think the term I'm looking for here is "array" as I'd like each of the page's results to append vertically to each other. I seem to be misinterpreting how this function is working:
results.append(details)
This is all thanks to QHarr's excellent answer found here: How Can I Export Scraped Data to Excel Horizontally?
Here is the full, working code I am using:
import requests, re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
examplelist = [['1'], ['2'], ['3']]
pages = [i for I in examplelist for i in I]
for key in pages:
driver = webdriver.Chrome(executable_path=r"C:\Users\User\Downloads\chromedriver_win32\chromedriver.exe")
driver.get('https://www.restaurant.com/listing?&&st=KS&p=KS&p=PA&page=' + str(key) + '&&searchradius=50&loc=10021')
time.sleep(10)
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".restaurants")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
restaurants = soup.select('.restaurants')
results = []
for restaurant in restaurants:
details = [re.sub(r'\s{2,}|[,]', '', i) for i in restaurant.select_one('h3 + p').text.strip().split('\n') if i != '']
details.insert(0, restaurant.select_one('h3 a').text)
results.append(details)
#print(results)
df = pd.DataFrame(results, columns= ['Name', 'Address', 'City', 'State', 'Zip', 'Phone', 'AdditionalInfo'])
df.to_csv(r'C:\Users\User\Documents\Restaurants.csv', sep=',', encoding='utf-8-sig', index = False)
driver.close()
Thanks
Upvotes: 0
Views: 339
Reputation: 5151
I think you keep emptying results
with results = []
inside the loop and so you lose what you've already put in there. Initialize outside of the loop like so
results=[]
for key in pages:
driver = webdriver.Chrome(executable_path=r"C:\Users\User\Downloads\chromedriver_win32\chromedriver.exe")
driver.get('https://www.restaurant.com/listing?&&st=KS&p=KS&p=PA&page=' + str(key) + '&&searchradius=50&loc=10021')
time.sleep(10)
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".restaurants")))
soup = BeautifulSoup(driver.page_source, 'html.parser')
restaurants = soup.select('.restaurants')
for restaurant in restaurants:
details = [re.sub(r'\s{2,}|[,]', '', i) for i in restaurant.select_one('h3 + p').text.strip().split('\n') if i != '']
details.insert(0, restaurant.select_one('h3 a').text)
results.append(details)
#print(results)
and remove that initialization from inside the loop.
Upvotes: 1