Unnesting List Expressed As String In Data Frame Column In R

Question

I'm working with a very large dataset that features a column with a variable-length list of terms that apply to each observation. Unfortunately, the column is character type and expresses the list as a string, complete with brackets, as well as quotations around and commas between the individual terms. In the code block below I have presented similar, example data frames. What I'd like to do is basically go from raw_df to desired_df.

library(tidyverse)


raw_df <- data.frame("movie_id" = c(1, 2, 3, 4),
                        "categories" = c("[\"Romance\", \"Comedy\", \"Holiday\"]", 
                                         "[\"Romance\", \"Comedy\"]", 
                                         "[\"Horror\"]",
                                         "[\"Action\", \"Thriller\"]") ) 

desired_df <- data.frame("movie_id" = c(1, 1, 1, 2, 2, 3, 4, 4),
                         "categories" = c("Romance", "Comedy", "Holiday",
                                          "Romance", "Comedy", "Horror", 
                                          "Action", "Thriller"))

What I am trying and failing to do is come up with an elegant, vectorized solution to transform the character string into a list, and then from the list extract each term (I think ideally in a long data frame format) for further analysis.

I wrote some non-vectorized functions to do this no problem within the context of a for loop, but the dataset is sufficiently large for that to take prohibitively long to run. For example:

unzip_terms <- function(t, x){
    y <- as.list(gsub("]", "", gsub("[", "", gsub("'", "", strsplit(x, ","), fixed=TRUE), fixed=TRUE), fixed = TRUE))
    df <- data.frame(id = t, term = y) %>%
        group_by(id, term) %>%
        summarize(count = n(), .groups = "drop")
    return(df)
}


compile_term_df <- function(df){
    for(i in 1:nrow(df)){
        working_df <- unzip_terms(df$id[i], df$term[i])
        if(i == 1){
            final_df <- working_df 
        } else {
            final_df <- final_df %>% bind_rows(working_df)
        }
    }
    return(final_df)
}

I have additionally tried different combinations of strsplit, lapply, and unnest, and no logic utilizing those has worked for me yet. Appreciate any direction anyone may be able to provide.

akrun · Accepted Answer

We can use separate_rows after removing the square brackets and the quotes

library(dplyr)
library(tidyr)
library(stringr)
raw_df %>% 
  mutate(categories = str_remove_all(categories, '$$|$$|"')) %>% 
  separate_rows(categories)

-output

# A tibble: 8 x 2
#  movie_id categories
#           
#1        1 Romance   
#2        1 Comedy    
#3        1 Holiday   
#4        2 Romance   
#5        2 Comedy    
#6        3 Horror    
#7        4 Action    
#8        4 Thriller

Unnesting List Expressed As String In Data Frame Column In R

Answers (1)

Related Questions