Reputation: 15
A member of the community helped me write the following code:
library(rvest)
library(tidyverse)
get_articles <- function(n_articles) {
page <- paste0("https://www.theroot.com/news/criminal-justice",
"?startIndex=",
n_articles) %>%
read_html()
tibble(
title = page %>%
html_elements(".aoiLP .js_link") %>%
html_text2(),
author = page %>%
html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
html_text2(),
date = page %>%
html_elements(".js_meta-time") %>%
html_text2(),
url = page %>%
html_elements(".aoiLP .js_link") %>%
html_attr("href")
)
}
df <- map_dfr(seq(0, 200, by = 20), get_articles)
However when I try to run it, I'm receiving the following error:
! Tibble columns must have compatible sizes. • Size 20: Existing data. • Size 21: Column
author
. ℹ Only values of size one are recycled.
I've searched solutions here, but haven't been able to make much sense out of them. I would appreciate any help.
Upvotes: 1
Views: 4269
Reputation: 4243
Since author
in your code returns a list of all authors in the url, and some articles have more than one author, the function is returning more authors than articles. A dataframe
or tibble
must have the same number of elements in each column.
For example, this throws a similar error
tibble::tibble(url = 1:3, author = 1:4)
#> Error: Tibble columns must have compatible sizes.
#> * Size 3: Existing data.
#> * Size 4: Column `author`.
#> i Only values of size one are recycled.
One option is to push the retrieval of the author name to the next step when you read the content of each article. Note the 10th url links to a video with no article body so it returns no content
.
library(rvest)
library(tidyverse)
get_articles <- function(n_articles) {
page <- paste0("https://www.theroot.com/news/criminal-justice",
"?startIndex=",
n_articles) %>%
read_html()
tibble(
title = page %>%
html_elements(".aoiLP .js_link") %>%
html_text2(),
date = page %>%
html_elements(".js_meta-time") %>%
html_text2(),
url = page %>%
html_elements(".aoiLP .js_link") %>%
html_attr("href")
)
}
#df <- map_dfr(seq(0, 200, by = 20), get_articles)
df <- map_dfr(0, get_articles) #small example
df %>%
slice(1:10) %>% # subset 10 rows for example
mutate(html = map(url, read_html),
content = map(html, ~ .x %>%
html_elements(".bOfvBY") %>%
html_text2 %>%
paste(collapse = ",")),
author = map(html, ~ .x %>%
html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
html_text2() %>%
set_names(paste0('author', 1:length(.))) #name the elements, which will become column names
)
) %>%
unnest(content) %>%
unnest_wider(author)
#> # A tibble: 10 x 7
#> title date url html content author1 author2
#> <chr> <chr> <chr> <lis> <chr> <chr> <chr>
#> 1 "US Soldier S~ Today ~ https://www.t~ <xml~ "A US soldier ~ Kalyn W~ <NA>
#> 2 "South Caroli~ Yester~ https://www.t~ <xml~ "On Tuesday, a~ Jessica~ <NA>
#> 3 "Abortion is ~ Tuesda~ https://www.t~ <xml~ "Abortion is o~ Jessica~ <NA>
#> 4 "Pennsylvania~ 9/02/2~ https://www.t~ <xml~ "Pennsylvania ~ Kalyn W~ <NA>
#> 5 "UN Committee~ 9/02/2~ https://www.t~ <xml~ "The devolving~ Jessica~ <NA>
#> 6 "DA Fani Will~ 8/30/2~ https://www.t~ <xml~ "There continu~ Murjani~ <NA>
#> 7 "How to Prote~ 8/30/2~ https://www.t~ <xml~ "The decision ~ Jessica~ <NA>
#> 8 "26 Alleged G~ 8/29/2~ https://www.t~ <xml~ "Twenty-six pe~ Keith R~ <NA>
#> 9 "Judge Angere~ 8/29/2~ https://www.t~ <xml~ "Sullivan Walt~ Kalyn W~ <NA>
#> 10 "Small Town H~ 8/27/2~ https://www.t~ <xml~ "" Kalyn W~ Adriano~
Created on 2022-09-08 by the reprex package (v2.0.0)
Upvotes: 3
Reputation: 7288
The core problem is that you're scraping a web page, so your data is unpredictable.
Specifically, a tibble
is a fancy table. When you construct it, all the columns have to be the same length, which is common sense. Either the webpage you're consuming, or the way you're processing it, is resulting in different length columns.
For instance, when I run this with n_articles
as 2, I get 20 titles, 21 authors, and 19 pages.
What you're doing won't work unless one of these changes:
Not sure which is more appropriate in your case, but I recommend writing a function that processes a single article and returns a row-like object for each.
That way you can define the function such that it always returns something for each property of an article, and you won't have these missing data points.
Upvotes: 1