Reputation: 11
I try to web crawl this page http://www.funda.nl/en/koop/leiden/ to get the max page it could show which is 29. I followed some online tutorial and located where 29 is in the html code, wrote this R code
url<- read_html("http://www.funda.nl/en/koop/leiden/")
url %>% html_nodes("#pagination-number.pagination-last") %>% html_attr("data-
pagination-page") %>% as.numeric()
However, what I got is numeric(0)
. If I remove as.numeric()
, I get character(0)
.
How is this done ?
Upvotes: 0
Views: 599
Reputation: 1
I've been dealing with the same issue and this worked for me:
> url = "http://www.funda.nl/en/koop/leiden/"
> last_page <-
+ last(read_html(url) %>%
+ html_nodes(css = ".pagination-pages") %>%
+ html_children()) %>%
+ html_text(trim = T) %>%
+ str_extract("[0-9]+") %>%
+ as.numeric()
> last_page
[1] 23
Upvotes: 0
Reputation: 322
I believe that both your identification of the html and your parsing of the html are wrong. To easily find the name of a CSS id, you can use a chrome extension called Selector Gadget. In your case, it also requires some parsing, accomplished in the str_extract_all()
function.
This will work:
url <- read_html("http://www.funda.nl/en/koop/leiden/")
pagination.last <- url %>%
html_node(".pagination-last") %>%
html_text() %>%
stringr::str_extract_all("[:number:]{1,2}", simplify = TRUE) %>%
as.numeric()
> pagination.last
[1] 29
You might find this other question helpful as well: R: Rvest - got hidden text i don't want
Upvotes: 0