Worice
Worice

Reputation: 4037

rvest returns string rather than list

According to the documentation, html_nodes() from rvest should return (quote) When applied to a list of nodes, html_nodes() returns all nodes, collapsing results into a new nodelist.

So, in my case, it returns a string where every node is collapsed. Why such behavior? Via debugging I was not able to get any change in that sense. It always returns the same string, where the page numbers are collapsed:

123456789101112131415...4950

library(tidyverse)  
library(rvest)    
library(stringr)   
library(rebus)     
library(lubridate)

url <-'https://footballdatabase.com/ranking/world/1'
html <read_html(url)

get_last_page <- function(html){
  pages_data <- html %>% 
    # The '.' indicates the class
    html_nodes('.pagination') %>% 
    # Extract the raw text as a list
    html_text()                   
  # The second to last of the buttons is the one
  pages_data[(length(pages_data)-1)] %>%            

    unname() %>%                                     
    # Convert to number
    as.numeric()                                     
}

I also tried to enlist the output with list(), without fortune. Also html_node() did not solve the problem.

Upvotes: 0

Views: 139

Answers (1)

lroha
lroha

Reputation: 34441

There is only a single node extracted with the selector '.pagination' so when html_text() is applied all the text in that node is returned collapsed together. Change the CSS selector to include the anchors then extract the text so a vector is returned for each node separately.

html %>%
  html_nodes('.pagination a') %>%
  html_text()

 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32"
[33] "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50"

Upvotes: 1

Related Questions