Reputation: 77
I'm working with a very large dataset that features a column with a variable-length list of terms that apply to each observation. Unfortunately, the column is character type and expresses the list as a string, complete with brackets, as well as quotations around and commas between the individual terms. In the code block below I have presented similar, example data frames. What I'd like to do is basically go from raw_df
to desired_df
.
library(tidyverse)
raw_df <- data.frame("movie_id" = c(1, 2, 3, 4),
"categories" = c("[\"Romance\", \"Comedy\", \"Holiday\"]",
"[\"Romance\", \"Comedy\"]",
"[\"Horror\"]",
"[\"Action\", \"Thriller\"]") )
desired_df <- data.frame("movie_id" = c(1, 1, 1, 2, 2, 3, 4, 4),
"categories" = c("Romance", "Comedy", "Holiday",
"Romance", "Comedy", "Horror",
"Action", "Thriller"))
What I am trying and failing to do is come up with an elegant, vectorized solution to transform the character string into a list, and then from the list extract each term (I think ideally in a long data frame format) for further analysis.
I wrote some non-vectorized functions to do this no problem within the context of a for loop, but the dataset is sufficiently large for that to take prohibitively long to run. For example:
unzip_terms <- function(t, x){
y <- as.list(gsub("]", "", gsub("[", "", gsub("'", "", strsplit(x, ","), fixed=TRUE), fixed=TRUE), fixed = TRUE))
df <- data.frame(id = t, term = y) %>%
group_by(id, term) %>%
summarize(count = n(), .groups = "drop")
return(df)
}
compile_term_df <- function(df){
for(i in 1:nrow(df)){
working_df <- unzip_terms(df$id[i], df$term[i])
if(i == 1){
final_df <- working_df
} else {
final_df <- final_df %>% bind_rows(working_df)
}
}
return(final_df)
}
I have additionally tried different combinations of strsplit
, lapply
, and unnest
, and no logic utilizing those has worked for me yet. Appreciate any direction anyone may be able to provide.
Upvotes: 1
Views: 49
Reputation: 886938
We can use separate_rows
after removing the square brackets and the quotes
library(dplyr)
library(tidyr)
library(stringr)
raw_df %>%
mutate(categories = str_remove_all(categories, '\\[|\\]|"')) %>%
separate_rows(categories)
-output
# A tibble: 8 x 2
# movie_id categories
# <dbl> <chr>
#1 1 Romance
#2 1 Comedy
#3 1 Holiday
#4 2 Romance
#5 2 Comedy
#6 3 Horror
#7 4 Action
#8 4 Thriller
Upvotes: 2