Reputation: 169
Here I am trying to scrape the teacher jobs from the https://www.indeed.co.in/?r=us I want to get that uploaded to the excel sheet like jobtitle, institute/school, salary, howmanydaysagoposted I wrote the code for scraping like this but I am getting all the text from the xpath which I defined
import selenium.webdriver
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
url = 'https://www.indeed.co.in/?r=us'
driver = webdriver.Chrome(r"mypython/bin/chromedriver_linux64/chromedriver")
driver.get(url)
driver.find_element_by_xpath('//*[@id="text-input-what"]').send_keys("teacher")
driver.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button').click()
items = driver.find_elements_by_xpath('//*[@id="resultsCol"]')
for item in items:
print(item.text)
And even I am able to scrape only one page and I want all the pages that are available after I search for teacher Please help me Thanks in advance.
Upvotes: 0
Views: 251
Reputation: 68
You'll have to nevigate to every page and scrape them one by one i.e. you'll have to automate click on next page button in selenium(use xpath of Next Page button element). Then extract using page source function. Hope I could help.
Upvotes: 0
Reputation: 761
try this, don't forget to import selenium modules
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
url = 'https://www.indeed.co.in/?r=us'
driver.get(url)
driver.find_element_by_xpath('//*[@id="text-input-what"]').send_keys("teacher")
driver.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button').click()
# scrape data
data = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "resultsCol")))
result_set = WebDriverWait(data, 10).until(
EC.presence_of_all_elements_located((By.CLASS_NAME, "jobsearch-SerpJobCard")))
for result in result_set:
data = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "resultsCol")))
result_set = WebDriverWait(data, 10).until(
EC.presence_of_all_elements_located((By.CLASS_NAME, "jobsearch-SerpJobCard")))
for result in result_set:
title = result.find_element_by_class_name("title").text
print(title)
school = result.find_element_by_class_name("company").text
print(school)
try:
salary = result.find_element_by_class_name("salary").text
print(salary)
except:
# some result set has no salary
pass
print("--------")
# move to next page
next_page = result.find_elements_by_xpath("//span[@class='pn']")[-1]
driver.execute_script("arguments[0].click();", next_page)
Upvotes: 0
Reputation:
I'd encourage you to checkout beautiful soup https://pypi.org/project/beautifulsoup4/ I've used this for scraping tables,
def read_table(table):
"""Read an IP Address table.
Args:
table: the Soup <table> element
Returns:
None if the table isn't an IP Address table, otherwise a list of
the IP Address:port values.
"""
header = None
rows = []
for tr in table.find_all('tr'):
if header is None:
header = read_header(tr)
if not header or header[0] != 'IP Address':
return None
else:
row = read_row(tr)
if row:
rows.append('{}:{}'.format(row[0], row[1]))
return rows
Here is just a snippet from one of my python projects https://github.com/backslash/WebScrapers/blob/master/us-proxy-scraper/us-proxy.py You can use beautiful soup to scrape tables incredibly easily, if your worried about it getting blocked then you just need to send the right headers. Also another advantage to using beautiful soup is that you don't have to wait as long for a lot of stuff.
HEADERS = requests.utils.default_headers()
HEADERS.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
Best of luck
Upvotes: 1