SimpleProgrammer
SimpleProgrammer

Reputation: 259

BeautifulSoup (Python): how grab text-string next to a tag (that may or may not exist)?

I think my title explains it pretty well the problem I am facing. Let's look at a picture of the problem. (You can find the web-page at this adress, however it has probably changed).

A heavy-duty machinery for sale.

I have highlighted the text that I want to grab in blue, this is the model-year 2008. Now, it is not necessary for the seller to submit the model-year, so this may or may not exist. But when it does exist it always follows the <i> tag with class ="fa fa-calender". My solution so far has been to grab all the text whitin <p class="result-details> ... </p>" (this then becomes a list) and then choose the second element, conditioned on that <i class="fa fa-calender> ... </i> exists. Otherwise I do not grab anything.

Now, it seems as this does not work in general since that text that comes before the second element can be aranged into more than one element if has a whitespace in it. So, is there any way (any function) that can grab a text string that neighbours another tag as seen in my picture?

PS: if I have made myself unclear, I just want to fetch the year 2008 from the post on the web page if it exists.

Edit

In this situation my code erroneously gives my the word "Hjulvältar" (bulldozer in english) instead of the year 2008.

CODE

from bs4 import BeautifulSoup
from datetime import date
import requests

url_avvikande = ['bomliftar','teleskop-bomliftar','kompakta-sjalvgaende-bomlyftar','bandschaktare','reachstackers','staplare']
today = date.today().isoformat()

url_main = 'https://www.mascus.se'
produktgrupper = ['lantbruksmaskiner','transportfordon','skogsmaskiner','entreprenadmaskiner','materialhantering','gronytemaskiner']
kategorier = {
  'lantbruksmaskiner': ['traktorer','sjalvgaende-falthackar','skordetroskor','atv','utv:er','snoskotrar'],
  'transportfordon': ['fordonstruckar','elektriska-fordon','terrangfordon'],
  'skogsmaskiner': ['skog-skordare','skog-gravmaskiner','skotare','drivare','fallare-laggare','skogstraktorer','lunnare','terminal-lastare'],
  'entreprenadmaskiner': ['gravlastare','bandgravare','minigravare-7t','hjulgravare','midigravmaskiner-7t-12t','atervinningshanterare','amfibiska-gravmaskiner','gravmaskiner-med-frontskopa','gravmaskiner-med-lang-rackvidd','gravmaskiner-med-slapskopa','rivningsgravare','specialgravmaskiner','hjullastare','kompaktlastare','minilastmaskiner','bandlastare','teleskopiska-hjullastare','redaskapshallare','gruvlastare','truckar-och-lastare-for-gruvor','bergborriggar','teleskoplastare','dumprar','minidumprar','gruvtruckar','banddumprar','specialiserade-dragare','vaghyvlar','vattentankbilar','allterrangkranar','terrangkranar-grov-terrang','-bandgaende-kranar','saxliftar','bomliftar','teleskop-bomliftar','personhissar-och-andra-hissar','kompakta-sjalvgaende-bomlyftar','krossar','mobila-krossar','sorteringsverk','mobila-sorteringsverk','bandschaktare','asfaltslaggningsmaskiner','--asfaltskallfrasmaskiner','tvavalsvaltar','envalsvaltar','jordkompaktorer','pneumatiska-hjulvaltar','andra-valtar','kombirullar','borrutrustning-ytborrning','horisontella-borrutrustning','trenchers-skar-gravmaskin'],
  'materialhantering': ['dieseltruckar','eldrivna-gaffeltruckar','lpg-truckar','gaffeltruckar---ovriga','skjutstativtruck','sidlastare','teleskopbomtruckar','terminaltraktorer','reachstackers','ovriga-materialhantering-maskiner','staplare-led','staplare','plocktruck-laglyftande','plocktruck-hoglyftande','plocktruck-mediumlyftande','dragtruck','terrangtruck','4-vagstruck','smalgangstruck','skurborsttorkar','inomhus-sopmaskiner','kombinationsskurborstar'],
  'gronytemaskiner': ['kompakttraktorer','akgrasklippare','robotgrasklippare','nollsvangare','plattformsklippare','sopmaskiner','verktygsfraktare','redskapsbarare','golfbilar','fairway-grasklippare','green-grasklippare','grasmattevaltar','ovriga-gronytemaskiner']
  }

url = 'https://www.mascus.se'
mappar = ['Lantbruk', 'Transportfordon', 'Skogsmaskiner', 'Entreprenad',  'Materialhantering', 'Grönytemaskiner']
index = -1
status = True
for produktgrupp in kategorier:
  index += 1
  mapp = mappar[index]
  save_path = f'/home/protector.local/vika99/webscrape_mascus/Annonser/{mapp}'
  underkategorier = kategorier[produktgrupp]
  for underkategori in underkategorier:
        # OBS
        if underkategori != 'borrutrustning-ytborrning' and status:
              continue
        else:
              status = False
        # OBS
        if underkategori in url_avvikande:
              url = f'{url_main}/{produktgrupp}/{underkategori}'
        elif underkategori == 'gravmaskiner-med-frontskopa':
              url = f'{url_main}/{produktgrupp}/begagnat-{underkategori}'
        elif underkategori == 'borrutrustning-ytborrning':
              url = f'{url_main}/{produktgrupp}/begagnad-{underkategori}'
        else:
              url = f'{url_main}/{produktgrupp}/begagnade-{underkategori}'
        file_name = f'{save_path}/{produktgrupp}_{underkategori}_{today}.txt'
        sida = 1
        print(url)
        with open(file_name, 'w') as f:
              while True:
                    print(sida)
                    html_text = None
                    soup = None
                    links = None
                    while links == None:
                          html_text = requests.get(url).text
                          soup = BeautifulSoup(html_text, 'lxml')
                          links = soup.find('ul', class_ = 'page-numbers')
                    annonser = soup.find_all('li', class_ = 'col-row single-result')
                    for annons in annonser:
                          modell = annons.find('a', class_ = 'title-font').text
                          if annons.p.find('i', class_ = 'fa fa-calendar') != None:
                                tillverkningsar = annons.find('p', class_ = 'result-details').text.strip().split(" ")[1]
                          else:
                                tillverkningsar = 'Ej angiven'
                          try:
                                pris = annons.find('span', class_ = 'title-font no-ws-wrap').text
                          except AttributeError:
                                pris = annons.find('span', class_ = 'title-font no-price').text
                          f.write(f'{produktgrupp:<21}{underkategori:25}{modell:<70}{tillverkningsar:<13}{pris:>14}\n')
                    url_part = None
                    sida += 1
                    try:
                          url_part = links.find('a', text = f'{sida}')['href']
                    except TypeError:
                          print(f'Avläsning av underkategori klar.')
                          break
                    url = f'{url_main}{url_part}'

Upvotes: 0

Views: 503

Answers (1)

QHarr
QHarr

Reputation: 84465

As you loop the listings you can test if that calendar icon class is present, if it is then grab the next_sibling

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.mascus.se/entreprenadmaskiner/begagnade-pneumatiska-hjulvaltar')
soup = bs(r.content, 'lxml')
listings = soup.select('.single-result')

for listing in listings:
    
    calendar = listing.select_one('.fa-calendar')
    
    if calendar is not None:
        print(calendar.next_sibling)
    else:
        print('Not present')

Upvotes: 2

Related Questions