Reputation: 3
I am scraping the number of newspaper articles containing certain words. For example, the word "Republican" in CA in 1929, from this website:
url = https://www.newspapers.com/search/#query=republican&dr_year=1929-1929&p_place=CA
I want to copy the number of hits (in the example, 23490), and I am using this code:
hits <- url %>%
read_html() %>%
html_nodes('.total-hits') %>%
html_text()
but html_text() returns an empty list. I would appreciate any help. Thanks!
Upvotes: 0
Views: 156
Reputation: 1579
The problem is you scrape the wrong URL, change it to https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA
and change html_nodes
to html_node
then your code would work.
Upvotes: 1
Reputation: 23574
Here is one way. Seeing the page source, it seems that you want to target td
. Then, do some string manipulation and crate the output. I leave the first 10 rows below.
read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>%
html_nodes("td") %>%
html_text() %>%
gsub(pattern = "\\n", replacement = "") %>%
matrix(ncol = 2, byrow = TRUE) %>%
as.data.frame() %>%
rename(state = V1, count = V2)
state count
1 California 23,490
2 Pennsylvania 51,697
3 New York 35,428
4 Indiana 23,199
5 New Jersey 22,787
6 Missouri 20,650
7 Ohio 15,270
8 Illinois 14,920
9 Iowa 14,676
10 Wisconsin 13,821
Another way is the following. I further specified where I wanted to get text. There are two targets. So I used map_dfc()
. In this way, I directly created a data frame. Then, I did similar jobs. This time, I converted character to numeric.
map_dfc(.x = c("td.tn", "td.tar"),
.f = function(x){
read_html("https://go.newspapers.com/results.php?query=republican&dr_year=1929-1929&p_place=CA") %>%
html_nodes(x) %>%
html_text()}
) %>%
rename(state = `...1`, count = `...2`) %>%
mutate(state = gsub(x = state, pattern = "\\n", replacement = ""),
count = as.numeric(sub(x = count, pattern = ",", replacement = "")))
state count
<chr> <dbl>
1 California 23490
2 Pennsylvania 51697
3 New York 35428
4 Indiana 23199
5 New Jersey 22787
6 Missouri 20650
7 Ohio 15270
8 Illinois 14920
9 Iowa 14676
10 Wisconsin 13821
Upvotes: 2