Jordan Lewis
Jordan Lewis

Reputation: 33

R: Creating new fields based on values in data frame list-columns

I flattened a json file using jsonlite and ended up with list-columns containing a key field, which in my example data below I call "Clothes":

df <- data.frame("ID" = c(1,2,3,4))
df$Things = list(list(Clothes = c("shirt","shoe","sock"), shapes = c("circle", "square")),
              list(Clothes = c("shirt","pant","jacket"), shapes = c("triangle", "circle")),
              list(Clothes = c("pant","belt"), shapes = c("pentagon", "square")),
              list(Clothes = c("shoe","scarf","sock"), shapes = c("circle", "pentagon")))

My goal is to pull out these values as new binary variables indicating whether each record contains each clothing items. I'd also like to pull out these clothing items into separate columns, even while the clothing lists sometimes have different lengths. As you can see, the list-column is two levels deep, with the Clothes list inside the Things list.

Here's what the sample output would look like:

dfOut <- mutate(df,belt = c(0,0,1,0),pant = c(0,1,1,0),shirt = c(1,1,0,0),
Clothes1 = c("shirt","shirt","pant","shoe"),
Clothes2 = c("shoe","pant","belt","scarf"),
Clothes3 = c("sock","jacket",NA,"sock"))

I assume the solutions would involve dplyr::mutate(), purrr::map(), apply(), or ifelse(). I'd also appreciate help with the right terms/concepts so I can better ask these types of questions in the future.

Upvotes: 2

Views: 113

Answers (3)

markdly
markdly

Reputation: 4534

To do this task, how about first creating a 'tidy' dataframe (see http://tidyr.tidyverse.org/ for a 'tidy data' definition):

library(dplyr)
library(tidyr)
library(purrr)

tidy_df <- df %>%
  mutate(Clothes = map(Things, "Clothes")) %>%
  unnest(Clothes)
tidy_df

#>    ID Clothes
#> 1   1   shirt
#> 2   1    shoe
#> 3   1    sock
#> 4   2   shirt
#> 5   2    pant
#> 6   2  jacket
#> 7   3    pant
#> 8   3    belt
#> 9   4    shoe
#> 10  4   scarf
#> 11  4    sock

From there you can make the different components for the required output with tidyr::spread

df1 <- tidy_df %>% 
  mutate(has_clothes = 1) %>%
  spread(Clothes, has_clothes, fill = 0)

df2 <- tidy_df %>% 
  group_by(ID) %>% 
  mutate(rownum = paste0("Clothes", row_number())) %>%
  spread(rownum, Clothes)

left_join(df1, df2)

#> Joining, by = "ID"
#>   ID belt jacket pant scarf shirt shoe sock Clothes1 Clothes2 Clothes3
#> 1  1    0      0    0     0     1    1    1    shirt     shoe     sock
#> 2  2    0      1    1     0     1    0    0    shirt     pant   jacket
#> 3  3    1      0    1     0     0    0    0     pant     belt     <NA>
#> 4  4    0      0    0     1     0    1    1     shoe    scarf     sock

That is, the desired output dfOut can be obtained by:

df %>% 
  left_join(df1, by = "ID") %>%
  left_join(df2, by = "ID")

Upvotes: 0

Glaud
Glaud

Reputation: 733

You can use simple double loop to achieve first part of your task.

for (n in c("shirt", "scarf", "sock", "belt", "jacket","pant")) {
  for (i in 1:dim(df)[1]) {
    df[[n]][i] <- ifelse(n %in% df$Things[[i]]$Clothes, 1, 0)
  }
}
df  

   ID      Things  shirt   scarf    sock    belt    jacket    pant   shoe
    1  1     shirt, shoe, sock, circle, square     1     0    1    0      0    0    1
    2  2 shirt, pant, jacket, triangle, circle     1     0    0    0      1    1    0
    3  3          pant, belt, pentagon, square     0     0    0    1      0    1    0
    4  4   shoe, scarf, sock, circle, pentagon     0     1    1    0      0    0    1

For the second part you could try something like that

Clothes <- unlist(df$Things)
Clothes <- data.frame(Name=attr(cl, "names"),Thing=cl)
for (j in 1:3) {
  assign( paste0("Clothes",j), 
           as.character( (Clothes %>% filter(Name == paste0("Clothes",j))) [,2]) )
}
Clothes2
[1] "shoe"  "pant"  "belt"  "scarf"

But it does not give NAs so it's not quite what you wanted.

Upvotes: 0

Nate
Nate

Reputation: 10671

We can do something like this to count up all of the clothes that occur in the df$Things list:

library(tidyverse)

# keep only Clothes, drop Shapes, and unlist for ease
df$Things <- purrr::map(df$Things, ~ .[1] %>% unlist)

# build a self-named vector of clothes types, for colnames from map_dfc()
all_clothes <- unique(unlist(df$Things)) %>% set_names(.)

# count occurances with grepl() and convert from bool to num
counts <- purrr::map_dfc(all_clothes, ~ as.numeric(grepl(., df$Things)))

# bolt it on
dplyr::bind_cols(df, counts)

  ID              Things shirt shoe sock pant jacket belt scarf
1  1   shirt, shoe, sock     1    1    1    0      0    0     0
2  2 shirt, pant, jacket     1    0    0    1      1    0     0
3  3          pant, belt     0    0    0    1      0    1     0
4  4   shoe, scarf, sock     0    1    1    0      0    0     1

Upvotes: 1

Related Questions