How to scrape all the pages and Upload the scraped data into excel in the required format using selenium

Here I am trying to scrape the teacher jobs from the https://www.indeed.co.in/?r=us I want to get that uploaded to the excel sheet like jobtitle, institute/school, salary, howmanydaysagoposted I wrote the code for scraping like this but I am getting all the text from the xpath which I defined

import selenium.webdriver

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions


url = 'https://www.indeed.co.in/?r=us'
driver = webdriver.Chrome(r"mypython/bin/chromedriver_linux64/chromedriver")
driver.get(url)

driver.find_element_by_xpath('//*[@id="text-input-what"]').send_keys("teacher")
driver.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button').click()


items = driver.find_elements_by_xpath('//*[@id="resultsCol"]')
for item in items:
    print(item.text)

And even I am able to scrape only one page and I want all the pages that are available after I search for teacher Please help me Thanks in advance.

Upvotes: 0

Answers (3)

user9113784

Reputation: 68

You'll have to nevigate to every page and scrape them one by one i.e. you'll have to automate click on next page button in selenium(use xpath of Next Page button element). Then extract using page source function. Hope I could help.

Upvotes: 0

Manali Kagathara

Reputation: 761

try this, don't forget to import selenium modules

from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

url = 'https://www.indeed.co.in/?r=us'

driver.get(url)

driver.find_element_by_xpath('//*[@id="text-input-what"]').send_keys("teacher")
driver.find_element_by_xpath('//*[@id="whatWhereFormId"]/div[3]/button').click()

 # scrape data
 data = WebDriverWait(driver, 10).until(
          EC.presence_of_element_located((By.ID, "resultsCol")))
 result_set = WebDriverWait(data, 10).until(
        EC.presence_of_all_elements_located((By.CLASS_NAME, "jobsearch-SerpJobCard")))

for result in result_set:
    data = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "resultsCol")))
    result_set = WebDriverWait(data, 10).until(
       EC.presence_of_all_elements_located((By.CLASS_NAME, "jobsearch-SerpJobCard")))

    for result in result_set:

        title = result.find_element_by_class_name("title").text
        print(title)

        school = result.find_element_by_class_name("company").text
        print(school)

        try:
           salary = result.find_element_by_class_name("salary").text
           print(salary)

        except:
           # some result set has no salary
           pass
        print("--------")

   # move to next page
   next_page = result.find_elements_by_xpath("//span[@class='pn']")[-1]
   driver.execute_script("arguments[0].click();", next_page)

Upvotes: 0

user12239999

Reputation:

I'd encourage you to checkout beautiful soup https://pypi.org/project/beautifulsoup4/ I've used this for scraping tables,

def read_table(table):
    """Read an IP Address table.
    Args:
      table: the Soup <table> element
    Returns:
      None if the table isn't an IP Address table, otherwise a list of
        the IP Address:port values.
    """
    header = None
    rows = []
    for tr in table.find_all('tr'):
        if header is None:
            header = read_header(tr)
            if not header or header[0] != 'IP Address':
                return None
        else:
            row = read_row(tr)
            if row:
                rows.append('{}:{}'.format(row[0], row[1]))
    return rows

Here is just a snippet from one of my python projects https://github.com/backslash/WebScrapers/blob/master/us-proxy-scraper/us-proxy.py You can use beautiful soup to scrape tables incredibly easily, if your worried about it getting blocked then you just need to send the right headers. Also another advantage to using beautiful soup is that you don't have to wait as long for a lot of stuff.

HEADERS = requests.utils.default_headers()
HEADERS.update({
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})

Best of luck

Upvotes: 1

How to scrape all the pages and Upload the scraped data into excel in the required format using selenium

Answers (3)

Related Questions