Giulio--
Giulio--

Reputation: 21

return number of results from scraping

I am trying to scrape INDEED:COM. I need python to return me the number of results corresponding to the research of job "data scientist" and city "milan". I think it can be done by "extracting the number of result displayed in the page" or by counting the number of results of the search (which is what I tried to do in passage 1) and 2)). First time i use python in my life, and I need this to get done for a project when this easy search is the starting point of a buisiness project. Can you help me programming it to return number of results??? Thanks indeed for the help you all!!

##import something 
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time

##tell python what I am looking for 
URL="""https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20"""
page = requests.get(URL)
soup = BeautifulSoup(page.text,"html.parser")
#print(soup.prettify())

##extract the job tile (didnt work)
def extract_job_title_from_result(soup): 
 jobs = []
 for div in soup.find_all(name="div",attrs={"class":"row"}):
     for a in div.find_all(name="a",attrs={"data-tn-element":"jobTitle"}):
       jobs.append(a["title"])
 return(jobs)
output = extract_job_title_from_result(soup)
print (output)

### 1) count the results
URL_for_count = "https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20".format(query, location)
soup_for_count = BeautifulSoup(urlopen(URL_for_count).read(), 'html.parser')
results_number = soup_for_count.find("div", attrs = {"id": "searchCount"}).text
number_of_results = int(results_number.split(sep = ' ')[-1].replace(',', ''))


### 2) reiterate the search through the different pages of Indeed, to get ALL of the results 
##nober of results shown per page = 10
i = int(number_of_results/100)
    for page_number in range(i + 1):
        URL_for_results = "https://it.indeed.com/Milano,-Lombardia-offerte-lavoro-data-scientist".format(query, location, str(100 * page_number))
        soup_for_results = BeautifulSoup(urlopen(URL_for_results).read(), 'html.parser')
        results = soup_for_results.find_all('div', attrs={'data-tn-component': 'organicJob'})

Upvotes: 2

Views: 950

Answers (1)

Ajax1234
Ajax1234

Reputation: 71451

You can use the find_all method from BeautifulSoup

from bs4 import BeautifulSoup as soup
import urllib
data = str(urllib.urlopen('https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20').read())
listing = soup(data, 'lxml')
jobs = [i.text[1:-1] for i in listing.find_all('h2')]
print(jobs)
print("number of jobs is: {}".format(len(jobs)))

Output:

[u'Data Scientist', u'Data Scientist', u'Junior Data Analyst', u'Oracle Data Integrator Junior', u'Junior Data Warehouse', u'Data Scientist/Biostatistician', u'URGENTE - RICERCA IMPIEGATO UFFICIO ORDINI / DATA ENTRY', u'Data Scientist with Machine Learning', u'DATA SCIENTIST- MACHINE LEARNING EXPERT', u'7224 Internal Audit - Quantitative Analyst']

number of jobs is: 10

Edit: to get the data for the first six pages:

final_data = [[b.text[1:-1] for b in soup(str(urllib.urlopen("https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start={}".format(10*i)).read()), "lxml").find_all('h2')] for i in range(6)]
lengths = list(map(len, final_data))
print(sum(lengths))

Output:

[[u'Data Scientist \u2013 Social Media Intelligence', u'DATA ANALYST', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Entry Specialist', u'Impiegato Data Entry', u'Data Scientist'], [u'Junior Data Scientist', u'DATA ANALYST JR \u2013 Milano', u'STAGE JUNIOR DATA ANALYST / DATA SCIENTIST BIG DATA', u'Machine Learning Scientist', u'Data Analyst', u'Data Analyst (Econometric modeling) Sede di Milano', u'Neolaureati in statistica, matematica, ingegneria-Data Scien...', u'Data Scientist', u'Data Scientist', u'Data Scientist'], [u'Data Scientist', u'Data Scientist', u'Junior Data Analyst', u'Oracle Data Integrator Junior', u'Junior Data Warehouse', u'Data Scientist/Biostatistician', u'URGENTE - RICERCA IMPIEGATO UFFICIO ORDINI / DATA ENTRY', u'Data Scientist with Machine Learning', u'DATA SCIENTIST- MACHINE LEARNING EXPERT', u'7224 Internal Audit - Quantitative Analyst'], [u'Collaboratori Data Entry', u'Data Scientist', u'DATA ENTRY', u'Consumer Data Scientist', u'DATA ANALYST', u'JUNIOR - RISK ADVISORY - TECHNOLOGY & DATA RISK - PRODUCTS &...', u'Data Manager Ematologia', u'Data Scientist', u'Esperto Tecnologie Big Data \u2013 Text Analysis \u2013 Data Mining', u'Data Entry'], [u'People Data Analyst', u'Data Integration Analyst \u2013 TIBCO', u'ORACLE BI - Big Data Analytics', u'Data Strategist', u'Data Governance Specialist', u'Big Data Specialist', u'Oracle Data Integrator Specialist', u'Innovation Analyst', u'Data Scientist', u'Big Data Engineer'], [u'JUNIOR BIG DATA ENGINEER', u'Junior Payment Analyst', u'Esperti BIG DATa e DWH', u'Data Warehouse Manager', u'Data Analyst', u'Big Data Engineer', u'data entry part time', u'Big Data & Datawarehouse Architect Location: Milano', u'Biomedical Signal/Image Processing Data Analyst', u'IT Big Data Engineer']]
[10, 10, 10, 10, 10, 10]
60

Upvotes: 1

Related Questions