Reputation: 79
I am currently working on creating a script to web scrape job postings on Indeed that will capture title, company, location and job description. Currently my script will iterate through the first five pages and print out a dataframe of each. However, my dataframe for Page 2 will only include 3 of the 15 job postings. I think this might be due to the pop up box that shows up asking for your email. In order to address this, I tried incorporating a .click to exit out of the popout. Unfortunately, this caused a return of "Timeout Exception". I added in element = WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CLASS_NAME, "popover-x-button-close icl-CloseButton"))) hoping that it would fix the issue, but no dice so far. Additionally, when I export to CSV, the only page of results that gets put into the CSV is page 5. I've included my code below. My apologies if these are very straightforward problems, I only started learning Python in order to do job code research three days ago. Thank you in advance!
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
options = Options()
options.add_argument("window-size=1400,1400")
PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)
for i in range(0,50,10):
driver.get('https://www.indeed.com/jobs?q=chemical%20engineer&l=united%20states&start='+str(i))
driver.implicitly_wait(5)
jobtitles = []
companies = []
locations = []
descriptions = []
jobs = driver.find_elements_by_class_name("slider_container")
for job in jobs:
jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
jobtitles.append(jobtitle)
company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
companies.append(company)
location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
locations.append(location)
description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
descriptions.append(description)
element = WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CLASS_NAME, "popover-x-button-close icl-CloseButton")))
close_popup = driver.find_element_by_class_name("popover-x-button-close icl-CloseButton")
close_popup.click()
df_da=pd.DataFrame()
df_da['JobTitle']=jobtitles
df_da['Company']=companies
df_da['Location']=locations
df_da['Description']=descriptions
print(df_da)
df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')
Upvotes: 3
Views: 350
Reputation: 33361
There are several issues here:
by_class_name
since this method is accepting a single class name, not a sequence of class names separated by spaces.click()
method directly on the returned by WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton")))
element, no need to get this element again with driver.find_element
I suggest something like the folllowing:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
options = Options()
options.add_argument("window-size=1400,1400")
PATH = "C://Program Files (x86)//chromedriver.exe"
driver = webdriver.Chrome(PATH)
for i in range(0,50,10):
driver.get('https://www.indeed.com/jobs?q=chemical%20engineer&l=united%20states&start='+str(i))
driver.implicitly_wait(5)
jobtitles = []
companies = []
locations = []
descriptions = []
jobs = driver.find_elements_by_class_name("slider_container")
for job in jobs:
jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip()
jobtitles.append(jobtitle)
company = job.find_element_by_class_name('companyName').text.replace("new", "").strip()
companies.append(company)
location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip()
locations.append(location)
description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip()
descriptions.append(description)
try:
WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click()
except:
pass
df_da=pd.DataFrame()
df_da['JobTitle']=jobtitles
df_da['Company']=companies
df_da['Location']=locations
df_da['Description']=descriptions
print(df_da)
df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')
Upvotes: 3