Reputation: 21
I am trying to scrape INDEED:COM. I need python to return me the number of results corresponding to the research of job "data scientist" and city "milan". I think it can be done by "extracting the number of result displayed in the page" or by counting the number of results of the search (which is what I tried to do in passage 1) and 2)). First time i use python in my life, and I need this to get done for a project when this easy search is the starting point of a buisiness project. Can you help me programming it to return number of results??? Thanks indeed for the help you all!!
##import something
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
##tell python what I am looking for
URL="""https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20"""
page = requests.get(URL)
soup = BeautifulSoup(page.text,"html.parser")
#print(soup.prettify())
##extract the job tile (didnt work)
def extract_job_title_from_result(soup):
jobs = []
for div in soup.find_all(name="div",attrs={"class":"row"}):
for a in div.find_all(name="a",attrs={"data-tn-element":"jobTitle"}):
jobs.append(a["title"])
return(jobs)
output = extract_job_title_from_result(soup)
print (output)
### 1) count the results
URL_for_count = "https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20".format(query, location)
soup_for_count = BeautifulSoup(urlopen(URL_for_count).read(), 'html.parser')
results_number = soup_for_count.find("div", attrs = {"id": "searchCount"}).text
number_of_results = int(results_number.split(sep = ' ')[-1].replace(',', ''))
### 2) reiterate the search through the different pages of Indeed, to get ALL of the results
##nober of results shown per page = 10
i = int(number_of_results/100)
for page_number in range(i + 1):
URL_for_results = "https://it.indeed.com/Milano,-Lombardia-offerte-lavoro-data-scientist".format(query, location, str(100 * page_number))
soup_for_results = BeautifulSoup(urlopen(URL_for_results).read(), 'html.parser')
results = soup_for_results.find_all('div', attrs={'data-tn-component': 'organicJob'})
Upvotes: 2
Views: 950
Reputation: 71451
You can use the find_all
method from BeautifulSoup
from bs4 import BeautifulSoup as soup
import urllib
data = str(urllib.urlopen('https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start=20').read())
listing = soup(data, 'lxml')
jobs = [i.text[1:-1] for i in listing.find_all('h2')]
print(jobs)
print("number of jobs is: {}".format(len(jobs)))
Output:
[u'Data Scientist', u'Data Scientist', u'Junior Data Analyst', u'Oracle Data Integrator Junior', u'Junior Data Warehouse', u'Data Scientist/Biostatistician', u'URGENTE - RICERCA IMPIEGATO UFFICIO ORDINI / DATA ENTRY', u'Data Scientist with Machine Learning', u'DATA SCIENTIST- MACHINE LEARNING EXPERT', u'7224 Internal Audit - Quantitative Analyst']
number of jobs is: 10
Edit: to get the data for the first six pages:
final_data = [[b.text[1:-1] for b in soup(str(urllib.urlopen("https://it.indeed.com/offerte-lavoro?q=data&l=lombardia&start={}".format(10*i)).read()), "lxml").find_all('h2')] for i in range(6)]
lengths = list(map(len, final_data))
print(sum(lengths))
Output:
[[u'Data Scientist \u2013 Social Media Intelligence', u'DATA ANALYST', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Analyst', u'Data Entry Specialist', u'Impiegato Data Entry', u'Data Scientist'], [u'Junior Data Scientist', u'DATA ANALYST JR \u2013 Milano', u'STAGE JUNIOR DATA ANALYST / DATA SCIENTIST BIG DATA', u'Machine Learning Scientist', u'Data Analyst', u'Data Analyst (Econometric modeling) Sede di Milano', u'Neolaureati in statistica, matematica, ingegneria-Data Scien...', u'Data Scientist', u'Data Scientist', u'Data Scientist'], [u'Data Scientist', u'Data Scientist', u'Junior Data Analyst', u'Oracle Data Integrator Junior', u'Junior Data Warehouse', u'Data Scientist/Biostatistician', u'URGENTE - RICERCA IMPIEGATO UFFICIO ORDINI / DATA ENTRY', u'Data Scientist with Machine Learning', u'DATA SCIENTIST- MACHINE LEARNING EXPERT', u'7224 Internal Audit - Quantitative Analyst'], [u'Collaboratori Data Entry', u'Data Scientist', u'DATA ENTRY', u'Consumer Data Scientist', u'DATA ANALYST', u'JUNIOR - RISK ADVISORY - TECHNOLOGY & DATA RISK - PRODUCTS &...', u'Data Manager Ematologia', u'Data Scientist', u'Esperto Tecnologie Big Data \u2013 Text Analysis \u2013 Data Mining', u'Data Entry'], [u'People Data Analyst', u'Data Integration Analyst \u2013 TIBCO', u'ORACLE BI - Big Data Analytics', u'Data Strategist', u'Data Governance Specialist', u'Big Data Specialist', u'Oracle Data Integrator Specialist', u'Innovation Analyst', u'Data Scientist', u'Big Data Engineer'], [u'JUNIOR BIG DATA ENGINEER', u'Junior Payment Analyst', u'Esperti BIG DATa e DWH', u'Data Warehouse Manager', u'Data Analyst', u'Big Data Engineer', u'data entry part time', u'Big Data & Datawarehouse Architect Location: Milano', u'Biomedical Signal/Image Processing Data Analyst', u'IT Big Data Engineer']]
[10, 10, 10, 10, 10, 10]
60
Upvotes: 1