Valentin Figueroa
Valentin Figueroa

Reputation: 3

html_nodes returns an empty list

I am scraping the number of newspaper articles containing certain words. For example, the word "Republican" in CA in 1929, from this website:

url = https://www.newspapers.com/search/#query=republican&dr_year=1929-1929&p_place=CA

I want to copy the number of hits (in the example, 23490), and I am using this code:

  hits <- url %>%
    read_html() %>%
    html_nodes('.total-hits') %>% 
    html_text()

but html_text() returns an empty list. I would appreciate any help. Thanks!

Upvotes: 0

Views: 156

Answers (2)

xwhitelight
xwhitelight

Reputation: 1579

The problem is you scrape the wrong URL, change it to https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA and change html_nodes to html_node then your code would work.

Upvotes: 1

jazzurro
jazzurro

Reputation: 23574

Here is one way. Seeing the page source, it seems that you want to target td. Then, do some string manipulation and crate the output. I leave the first 10 rows below.

read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>% 
  html_nodes("td") %>% 
  html_text() %>% 
  gsub(pattern = "\\n", replacement = "") %>% 
  matrix(ncol = 2, byrow = TRUE) %>% 
  as.data.frame() %>% 
  rename(state = V1, count = V2)

                  state  count
1            California 23,490
2          Pennsylvania 51,697
3              New York 35,428
4               Indiana 23,199
5            New Jersey 22,787
6              Missouri 20,650
7                  Ohio 15,270
8              Illinois 14,920
9                  Iowa 14,676
10            Wisconsin 13,821

Another way is the following. I further specified where I wanted to get text. There are two targets. So I used map_dfc(). In this way, I directly created a data frame. Then, I did similar jobs. This time, I converted character to numeric.

map_dfc(.x = c("td.tn", "td.tar"),
        .f = function(x){
              read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>% 
              html_nodes(x) %>% 
              html_text()}
        ) %>% 
rename(state = `...1`, count = `...2`) %>% 
mutate(state = gsub(x = state, pattern = "\\n", replacement = ""),
       count = as.numeric(sub(x = count, pattern = ",", replacement = "")))

   state        count
   <chr>        <dbl>
 1 California   23490
 2 Pennsylvania 51697
 3 New York     35428
 4 Indiana      23199
 5 New Jersey   22787
 6 Missouri     20650
 7 Ohio         15270
 8 Illinois     14920
 9 Iowa         14676
10 Wisconsin    13821

Upvotes: 2

Related Questions