Reputation: 33
I flattened a json file using jsonlite
and ended up with list-columns containing a key field, which in my example data below I call "Clothes":
df <- data.frame("ID" = c(1,2,3,4))
df$Things = list(list(Clothes = c("shirt","shoe","sock"), shapes = c("circle", "square")),
list(Clothes = c("shirt","pant","jacket"), shapes = c("triangle", "circle")),
list(Clothes = c("pant","belt"), shapes = c("pentagon", "square")),
list(Clothes = c("shoe","scarf","sock"), shapes = c("circle", "pentagon")))
My goal is to pull out these values as new binary variables indicating whether each record contains each clothing items. I'd also like to pull out these clothing items into separate columns, even while the clothing lists sometimes have different lengths. As you can see, the list-column is two levels deep, with the Clothes list inside the Things list.
Here's what the sample output would look like:
dfOut <- mutate(df,belt = c(0,0,1,0),pant = c(0,1,1,0),shirt = c(1,1,0,0),
Clothes1 = c("shirt","shirt","pant","shoe"),
Clothes2 = c("shoe","pant","belt","scarf"),
Clothes3 = c("sock","jacket",NA,"sock"))
I assume the solutions would involve dplyr::mutate()
, purrr::map()
, apply()
, or ifelse()
. I'd also appreciate help with the right terms/concepts so I can better ask these types of questions in the future.
Upvotes: 2
Views: 113
Reputation: 4534
To do this task, how about first creating a 'tidy' dataframe (see http://tidyr.tidyverse.org/ for a 'tidy data' definition):
library(dplyr)
library(tidyr)
library(purrr)
tidy_df <- df %>%
mutate(Clothes = map(Things, "Clothes")) %>%
unnest(Clothes)
tidy_df
#> ID Clothes
#> 1 1 shirt
#> 2 1 shoe
#> 3 1 sock
#> 4 2 shirt
#> 5 2 pant
#> 6 2 jacket
#> 7 3 pant
#> 8 3 belt
#> 9 4 shoe
#> 10 4 scarf
#> 11 4 sock
From there you can make the different components for the required output with tidyr::spread
df1 <- tidy_df %>%
mutate(has_clothes = 1) %>%
spread(Clothes, has_clothes, fill = 0)
df2 <- tidy_df %>%
group_by(ID) %>%
mutate(rownum = paste0("Clothes", row_number())) %>%
spread(rownum, Clothes)
left_join(df1, df2)
#> Joining, by = "ID"
#> ID belt jacket pant scarf shirt shoe sock Clothes1 Clothes2 Clothes3
#> 1 1 0 0 0 0 1 1 1 shirt shoe sock
#> 2 2 0 1 1 0 1 0 0 shirt pant jacket
#> 3 3 1 0 1 0 0 0 0 pant belt <NA>
#> 4 4 0 0 0 1 0 1 1 shoe scarf sock
That is, the desired output dfOut
can be obtained by:
df %>%
left_join(df1, by = "ID") %>%
left_join(df2, by = "ID")
Upvotes: 0
Reputation: 733
You can use simple double loop to achieve first part of your task.
for (n in c("shirt", "scarf", "sock", "belt", "jacket","pant")) {
for (i in 1:dim(df)[1]) {
df[[n]][i] <- ifelse(n %in% df$Things[[i]]$Clothes, 1, 0)
}
}
df
ID Things shirt scarf sock belt jacket pant shoe
1 1 shirt, shoe, sock, circle, square 1 0 1 0 0 0 1
2 2 shirt, pant, jacket, triangle, circle 1 0 0 0 1 1 0
3 3 pant, belt, pentagon, square 0 0 0 1 0 1 0
4 4 shoe, scarf, sock, circle, pentagon 0 1 1 0 0 0 1
For the second part you could try something like that
Clothes <- unlist(df$Things)
Clothes <- data.frame(Name=attr(cl, "names"),Thing=cl)
for (j in 1:3) {
assign( paste0("Clothes",j),
as.character( (Clothes %>% filter(Name == paste0("Clothes",j))) [,2]) )
}
Clothes2
[1] "shoe" "pant" "belt" "scarf"
But it does not give NAs so it's not quite what you wanted.
Upvotes: 0
Reputation: 10671
We can do something like this to count up all of the clothes that occur in the df$Things
list:
library(tidyverse)
# keep only Clothes, drop Shapes, and unlist for ease
df$Things <- purrr::map(df$Things, ~ .[1] %>% unlist)
# build a self-named vector of clothes types, for colnames from map_dfc()
all_clothes <- unique(unlist(df$Things)) %>% set_names(.)
# count occurances with grepl() and convert from bool to num
counts <- purrr::map_dfc(all_clothes, ~ as.numeric(grepl(., df$Things)))
# bolt it on
dplyr::bind_cols(df, counts)
ID Things shirt shoe sock pant jacket belt scarf
1 1 shirt, shoe, sock 1 1 1 0 0 0 0
2 2 shirt, pant, jacket 1 0 0 1 1 0 0
3 3 pant, belt 0 0 0 1 0 1 0
4 4 shoe, scarf, sock 0 1 1 0 0 0 1
Upvotes: 1