Reputation: 953
I have a similar problem as is shown in the this question: link.
I am scraping this web page and I would like to download the text in the option value, I mean, where you can read "Seleccionar municipio", that is, the next node in the html code:
<select name="txtMunicipio" id="txtMunicipio" class="inputText"><option value="">-------------------------------------</option>
<option value="001">ALEGRIA-DULANTZI</option>
<option value="002">AMURRIO</option>
<option value="049">AÑANA</option>
<option value="003">ARAMAIO</option>
<option value="006">ARMIÑON</option>
<option value="037">ARRAIA-MAEZTU</option>
...
</select>
And I would like to obtain something like follows
ID Name
001 ALEGRIA-DULANTZI
002 AMURRIO
049 AÑANA
...
So what I have tried is something similar as the question I made reference before, using the next lines in R:
web_page<-read_html("https://catastroalava.tracasa.es/descargas/?lang=es")
web_page %>% html_nodes("select#txtMunicipio.inputText option") %>% html_attr("value")
web_page %>% html_nodes(".inputText option") %>% html_attr("value")
web_page %>% html_nodes("#txtMunicipio option") %>% html_attr("value")
web_page %>% html_nodes("select#txtMunicipio.inputText option") %>% html_text()
web_page %>% html_nodes(".inputText option") %>% html_text()
web_page %>% html_nodes("#txtMunicipio option") %>% html_text()
But what I ever obtain is:
character(0)
Please, could you help me what are the parameters that I have to put in html_nodes
function to download the information?
Thanks in advance
Upvotes: 1
Views: 615
Reputation: 78832
Your incorrect assertion regarding question duplication notwithstanding, the data is on the page. I suspect you used Selector Gadget or some such tool to identify the rendered page nodes and never viewed the original source of the web site (this is a super common issue and one reason I have much disdain for Selector Gadget and the prevalence of advice that says to use it first before any other investigation).
The popup is built dynamically after page load and here's the source:
I've personally authored more than a few SO answers that show how to get such data, but we'll ignore the existence of those for this.
The general idea is to get just enough javascript (that will work with the V8 package since it's based on a rly rly old V8 engine version) to be let it parse the data and then marshal the values back to R.
library(rvest)
library(V8)
library(purrr)
ctx <-v8() # we need to convert javascript to R
pg <- read_html("https://catastroalava.tracasa.es/descargas/?lang=es")
html_nodes(pg, xpath=".//script[contains(., 'ALEGRIA-DULANTZI')]") %>%
html_text() %>%
gsub("function escribeMunicipios.*$", "", .) %>% # get rid of everything but the data
ctx$eval(.)
ctx$get("municipios") %>%
setNames(c("ID", "Name"))
## ID Name
## 1 001 ALEGRIA-DULANTZI
## 2 002 AMURRIO
## 3 049 AÑANA
## 4 003 ARAMAIO
## 5 006 ARMIÑON
## 6 037 ARRAIA-MAEZTU
## 7 008 ARRAZUA-UBARRUNDIA
## 8 004 ARTZINIEGA
## 9 009 ASPARRENA
## 10 010 AYALA
## ... goes on ...
Upvotes: 3