R Web Scraping with RVEST: DT and DD Tag

Question

i am scraping data of an car website (Example link). Since now everything worked fine but i got problems with getting one specific value. Below in the image, you see a list, where i want to get the value of "Anzahl Türen" (represents the number of doors in one car).

In HTML, "Anzahl" looks like this:

Anzahl Türen
3

How do I get the value "3" here?

For getting other values from this list, I have already bypassed the way with the HTML tag "dt" and "dd", with following function for example which gets me the value of "Außenfarbe" (color):

get_color = function(links) {
     car_page = read_html(links)
     col = car_page %>% html_nodes("dd a") %>%
        html_text()
     colors = c("Blau","Rot","Schwarz","Silber", "Beige", "Braun","Bronze","Gelb","Grau", 
                "Grün","Violett","Weiß","Orange","Gold")
     intersect(col, colors)
}

But now the value I want is a number and no text anymore, so this function doesn't work for another row.

Thanks for your help! :)

Ronak Shah · Accepted Answer

You can actually extract all the values in a dataframe and then select the ones which you want.

library(rvest)
url <- 'https://www.autoscout24.de/angebote/volkswagen-lupo-1-2-tdi-3l-diesel-gruen-b5ee4316-b672-490a-917d-6732bb6065a6?&cldtidx=20&cldtsrc=listPage&searchId=603681428'
car_page = read_html(url)

data <- tibble::tibble(name = car_page %>% html_nodes("dt") %>% html_text(), 
               value = car_page %>% html_nodes("dd") %>% html_text() %>% trimws)

# A tibble: 22 x 2
#   name                            value      
#                                    
# 1 Zustand                         Gebraucht  
# 2 HU Prüfung                      04/2023    
# 3 Letzter Kundendienst            03/2021    
# 4 Letzter Wechsel des Zahnriemens 05/2015    
# 5 Marke                           Volkswagen 
# 6 Modell                          Lupo       
# 7 Erstzulassung                   2000       
# 8 Außenfarbe                      Grün       
# 9 Innenausstattung                Stoff, Grün
#10 Karosserieform                  Kleinwagen 
# … with 12 more rows

data$value[data$name == "Anzahl Türen"]
#[1] "3"

Testing on another url -

url <- 'https://www.autoscout24.de/angebote/peugeot-206-110-tendance-tuev-neu-1-6-109ps-4-5tuetig-benzin-blau-cd721b01-494f-47cc-aead-c3c41d728935?ipc=recommendation&ipl=detailpage-engine-itemBased'

data$value[data$name == "Anzahl Türen"]
#[1] "5"

R Web Scraping with RVEST: DT and DD Tag

Answers (2)

Related Questions