dedavilar
dedavilar

Reputation: 25

Create a specific Web Scraper

I am making the effort to learn to scrape in Python and in this case my idea is to make a tool that obtains data from a web page. I have a problem in proposing the "for" to go through the page and collect the data of each box (item) as they are:

It is not a task, it is my own initiative but I am not moving forward for which I thank you for your help.

Here is what I have of code:

from bs4 import BeautifulSoup
import requests

URL_BASE = "https://www.milanuncios.com/ofertas-de-empleo-en-madrid/?dias=3&demanda=n&pagina="
MAX_PAGES = 2
counter = 0

for i in range(0, MAX_PAGES):

    #Building the URL
    if i > 0:
        url = "%s%d" % (URL_BASE, i)
    else:
        url = URL_BASE

    #We make the request to the web
    req = requests.get(url)
    
    #We check that the request returns a Status Code = 200
    statusCode = req.status_code
    if statusCode == 200:

        #We pass the HTML content of the web to a BeautifulSoup () object
        html = BeautifulSoup(req.text, "html.parser")

        #We get all the divs where the inputs are
        entradas_IDoffer = html.find_all('div', {'class': 'aditem-header'})
        
        #We go through all the inputs and extract info
        for entrada1 in entradas_IDoffer:
            
            #THIS ARE SOME ATTEMPS
            #Title = entrada.find('div', {'class': 'aditem-detail-title'}).getText()
            #location = entrada.find('div', {'class': 'list-location-region'}).getText()
            #content = entrada.find('div', {'class': 'tx'}).getText()
            #phone = entrada.find('div', {'class': 'telefonos'}).getText()
        
            #Offer Title
            entradas_Title = html.find_all('div', {'class': 'aditem-detail'})
            for entrada2 in entradas_Title:
                counter += 1
                Title = entrada2.find('a', {'class': 'aditem-detail-title'}).getText()
                
            counter += 1
            IDoffer = entrada1.find('div', {'class': 'x5'}).getText()
                    
                    

        #Location
        #entradas_location = html.find_all('div', {'class': 'aditem-detail'})
        #for entrada4 in entradas_location:
        #    counter += 1
        #    location = entrada4.find('div', {'class': 'list-location-region'}).getText()

                    #Offer content
                    #entradas_content = html.find_all('div', {'class': 'aditem-detail'})
                    #for entrada3 in entradas_content:
                     #   counter += 1
                      #  content = entrada3.find('div', {'class': 'tx'}).getText()

            print("%d - %s  \n%s\n%s" % (counter, IDoffer.strip(),url,Title))

    else:
        try:
            r = requests.head(req)
            print(r.status_code)

        except requests.ConnectionError:
            print("failed to connect")
        break
        #If the page no longer exists and it gives me a 400

Upvotes: 2

Views: 168

Answers (1)

yf879
yf879

Reputation: 168

Correct entradas_IDoffer,

entradas_IDoffer = html.find_all("div", class_="aditem CardTestABClass")

Title is located under "a" tag not "div"

title = entrada.find("a", class_="aditem-detail-title").text.strip()
location = entrada.find("div", class_="list-location-region").text.strip()
content = entrada.find("div", class_="tx").text.strip()

do like this for other data

they might be loading Phone number with javascript so you may not able to get that with bs4, you can get that using selenium.

You wrote very lengthy code to loop through multiple pages, just do this to go through page 1 and 2 using range. Put url in formatted string.

for page in range(1, 3):
    url =  f'https://www.milanuncios.com/ofertas-de-empleo-en-madrid/?dias=3&demanda=n&pagina={page}'

Full code:

import requests
from bs4 import BeautifulSoup

for page in range(1, 5):
    url =  f'https://www.milanuncios.com/ofertas-de-empleo-en-madrid/?dias=3&demanda=n&pagina={page}'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    entradas_IDoffer = soup.find_all("div", class_="aditem CardTestABClass")

    for entrada in entradas_IDoffer:
        title = entrada.find("a", class_="aditem-detail-title").text.strip()
        ID = entrada.find("div", class_="x5").text.strip()
        location = entrada.find("div", class_="list-location-region").text.strip()
        content = entrada.find("div", class_="tx").text.strip()
        
        print(title, ID, location, content)

Upvotes: 1

Related Questions