Joshua Crutchfield
Joshua Crutchfield

Reputation: 15

How do I get rid of the error: Tibble columns must have compatible sizes?

A member of the community helped me write the following code:

library(rvest)
library(tidyverse)

get_articles <- function(n_articles) {
  page <- paste0("https://www.theroot.com/news/criminal-justice",
                 "?startIndex=",
                 n_articles) %>%
    read_html()
  
  tibble(
    title = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_text2(),
    author = page %>%
      html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
      html_text2(),
    date = page %>%
      html_elements(".js_meta-time") %>%
      html_text2(),
    url = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_attr("href")
  )
}

df <- map_dfr(seq(0, 200, by = 20), get_articles)

However when I try to run it, I'm receiving the following error:

! Tibble columns must have compatible sizes. • Size 20: Existing data. • Size 21: Column author. ℹ Only values of size one are recycled.

I've searched solutions here, but haven't been able to make much sense out of them. I would appreciate any help.

Upvotes: 1

Views: 4269

Answers (2)

nniloc
nniloc

Reputation: 4243

Since author in your code returns a list of all authors in the url, and some articles have more than one author, the function is returning more authors than articles. A dataframe or tibble must have the same number of elements in each column.

For example, this throws a similar error

tibble::tibble(url = 1:3, author = 1:4)
#> Error: Tibble columns must have compatible sizes.
#> * Size 3: Existing data.
#> * Size 4: Column `author`.
#> i Only values of size one are recycled.

One option is to push the retrieval of the author name to the next step when you read the content of each article. Note the 10th url links to a video with no article body so it returns no content.

library(rvest)
library(tidyverse)


get_articles <- function(n_articles) {
  page <- paste0("https://www.theroot.com/news/criminal-justice",
                 "?startIndex=",
                 n_articles) %>%
    read_html()
  
  tibble(
    title = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_text2(),
    date = page %>%
      html_elements(".js_meta-time") %>%
      html_text2(),
    url = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_attr("href")
  )
}

#df <- map_dfr(seq(0, 200, by = 20), get_articles)
df <- map_dfr(0, get_articles) #small example


df %>%
  slice(1:10) %>% # subset 10 rows for example
  mutate(html = map(url, read_html),
         content = map(html, ~ .x %>%
                         html_elements(".bOfvBY") %>%
                         html_text2 %>% 
                         paste(collapse = ",")),
         author = map(html, ~ .x %>%
                        html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
                        html_text2() %>%
                        set_names(paste0('author', 1:length(.))) #name the elements, which will become column names
                      )
         ) %>%
  unnest(content) %>%
  unnest_wider(author)
#> # A tibble: 10 x 7
#>    title          date    url            html  content         author1  author2 
#>    <chr>          <chr>   <chr>          <lis> <chr>           <chr>    <chr>   
#>  1 "US Soldier S~ Today ~ https://www.t~ <xml~ "A US soldier ~ Kalyn W~ <NA>    
#>  2 "South Caroli~ Yester~ https://www.t~ <xml~ "On Tuesday, a~ Jessica~ <NA>    
#>  3 "Abortion is ~ Tuesda~ https://www.t~ <xml~ "Abortion is o~ Jessica~ <NA>    
#>  4 "Pennsylvania~ 9/02/2~ https://www.t~ <xml~ "Pennsylvania ~ Kalyn W~ <NA>    
#>  5 "UN Committee~ 9/02/2~ https://www.t~ <xml~ "The devolving~ Jessica~ <NA>    
#>  6 "DA Fani Will~ 8/30/2~ https://www.t~ <xml~ "There continu~ Murjani~ <NA>    
#>  7 "How to Prote~ 8/30/2~ https://www.t~ <xml~ "The decision ~ Jessica~ <NA>    
#>  8 "26 Alleged G~ 8/29/2~ https://www.t~ <xml~ "Twenty-six pe~ Keith R~ <NA>    
#>  9 "Judge Angere~ 8/29/2~ https://www.t~ <xml~ "Sullivan Walt~ Kalyn W~ <NA>    
#> 10 "Small Town H~ 8/27/2~ https://www.t~ <xml~ ""              Kalyn W~ Adriano~

Created on 2022-09-08 by the reprex package (v2.0.0)

Upvotes: 3

Chris
Chris

Reputation: 7288

The core problem is that you're scraping a web page, so your data is unpredictable.

Specifically, a tibble is a fancy table. When you construct it, all the columns have to be the same length, which is common sense. Either the webpage you're consuming, or the way you're processing it, is resulting in different length columns.

For instance, when I run this with n_articles as 2, I get 20 titles, 21 authors, and 19 pages.

What you're doing won't work unless one of these changes:

  1. The source data (the html of the page) is cleaned up, so there's exactly one title, author, date, and url for each article, identified in a predictable fashion, OR
  2. You come up with some rules and/or default values so that you can fill in the gaps.

Not sure which is more appropriate in your case, but I recommend writing a function that processes a single article and returns a row-like object for each.

That way you can define the function such that it always returns something for each property of an article, and you won't have these missing data points.

Upvotes: 1

Related Questions