Lazar
Lazar

Reputation: 11

How to use rvest to web crawling correctly?

I try to web crawl this page http://www.funda.nl/en/koop/leiden/ to get the max page it could show which is 29. I followed some online tutorial and located where 29 is in the html code, wrote this R code

url<-  read_html("http://www.funda.nl/en/koop/leiden/")

url %>% html_nodes("#pagination-number.pagination-last") %>% html_attr("data-
pagination-page") %>% as.numeric() 

However, what I got is numeric(0). If I remove as.numeric(), I get character(0).

How is this done ?

Upvotes: 0

Views: 599

Answers (2)

natorokado
natorokado

Reputation: 1

I've been dealing with the same issue and this worked for me:

> url = "http://www.funda.nl/en/koop/leiden/"
> last_page <-
+   last(read_html(url) %>% 
+          html_nodes(css = ".pagination-pages") %>%
+          html_children()) %>% 
+   html_text(trim = T) %>% 
+   str_extract("[0-9]+") %>% 
+   as.numeric()
> last_page
[1] 23

Upvotes: 0

ZLevine
ZLevine

Reputation: 322

I believe that both your identification of the html and your parsing of the html are wrong. To easily find the name of a CSS id, you can use a chrome extension called Selector Gadget. In your case, it also requires some parsing, accomplished in the str_extract_all() function.

This will work:

url <-  read_html("http://www.funda.nl/en/koop/leiden/")

pagination.last <- url %>% 
  html_node(".pagination-last") %>%
  html_text() %>% 
  stringr::str_extract_all("[:number:]{1,2}", simplify = TRUE) %>%
  as.numeric()

> pagination.last
[1] 29

You might find this other question helpful as well: R: Rvest - got hidden text i don't want

Upvotes: 0

Related Questions