Henry Navarro
Henry Navarro

Reputation: 953

How scrape values from HTML select/option tags in R

I have a similar problem as is shown in the this question: link.

I am scraping this web page and I would like to download the text in the option value, I mean, where you can read "Seleccionar municipio", that is, the next node in the html code:

<select name="txtMunicipio" id="txtMunicipio" class="inputText"><option value="">-------------------------------------</option>
  <option value="001">ALEGRIA-DULANTZI</option>
  <option value="002">AMURRIO</option>
  <option value="049">AÑANA</option>
  <option value="003">ARAMAIO</option>
  <option value="006">ARMIÑON</option>
  <option value="037">ARRAIA-MAEZTU</option>
  ...
  </select>

And I would like to obtain something like follows

ID   Name
001  ALEGRIA-DULANTZI
002  AMURRIO
049  AÑANA
...

So what I have tried is something similar as the question I made reference before, using the next lines in R:

web_page<-read_html("https://catastroalava.tracasa.es/descargas/?lang=es")

web_page %>% html_nodes("select#txtMunicipio.inputText option") %>% html_attr("value")
web_page %>% html_nodes(".inputText option") %>% html_attr("value")
web_page %>% html_nodes("#txtMunicipio option") %>% html_attr("value")

web_page %>% html_nodes("select#txtMunicipio.inputText option") %>% html_text()
web_page %>% html_nodes(".inputText option") %>% html_text()
web_page %>% html_nodes("#txtMunicipio option") %>% html_text()

But what I ever obtain is:

character(0)

Please, could you help me what are the parameters that I have to put in html_nodes function to download the information?

Thanks in advance

Upvotes: 1

Views: 615

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78832

Your incorrect assertion regarding question duplication notwithstanding, the data is on the page. I suspect you used Selector Gadget or some such tool to identify the rendered page nodes and never viewed the original source of the web site (this is a super common issue and one reason I have much disdain for Selector Gadget and the prevalence of advice that says to use it first before any other investigation).

The popup is built dynamically after page load and here's the source:

enter image description here

I've personally authored more than a few SO answers that show how to get such data, but we'll ignore the existence of those for this.

The general idea is to get just enough javascript (that will work with the V8 package since it's based on a rly rly old V8 engine version) to be let it parse the data and then marshal the values back to R.

library(rvest)
library(V8)
library(purrr)

ctx <-v8() # we need to convert javascript to R

pg <- read_html("https://catastroalava.tracasa.es/descargas/?lang=es")

html_nodes(pg, xpath=".//script[contains(., 'ALEGRIA-DULANTZI')]") %>% 
  html_text() %>% 
  gsub("function escribeMunicipios.*$", "", .) %>%  # get rid of everything but the data
  ctx$eval(.)

ctx$get("municipios") %>% 
  setNames(c("ID", "Name"))
##     ID                 Name
## 1  001     ALEGRIA-DULANTZI
## 2  002              AMURRIO
## 3  049                AÑANA
## 4  003              ARAMAIO
## 5  006              ARMIÑON
## 6  037        ARRAIA-MAEZTU
## 7  008   ARRAZUA-UBARRUNDIA
## 8  004           ARTZINIEGA
## 9  009            ASPARRENA
## 10 010                AYALA
## ... goes on ...

Upvotes: 3

Related Questions