mmyoung77
mmyoung77

Reputation: 1417

How to associate a list of character vectors with your data frame in R

The shape of my data is fairly simple:

set.seed(1337)
id <- c(1:4)
values <- runif(0, 1, n=4)
df <- data.frame(id, values) 
df
  id     values
1  1 0.57632155
2  2 0.56474213
3  3 0.07399023
4  4 0.45386562

What isn't simple: I have a list of character-value arrays that match up to each row, where each list item can be empty, or it can contain up to 5 separate tags, like...

tags <- list(
  c("A"),
  NA,
  c("A", "B", "C"),
  c("B", "C")
)

I will be asked various questions using the tags as classifers, for instance, "what is the average value of all rows with a B tag?" Or "how many rows contain both tag A and tag C?"

What way would you choose to store the tags so that I can do this? My real-life data file is quite large, which makes experimenting with unlist or other commands difficult.

Upvotes: 1

Views: 342

Answers (2)

akrun
akrun

Reputation: 887691

Here are couple of options to get the expected output. Create 'tags' as a list column in the dataset and unnest (already from the comments), and then summarise the number of 'A' or 'C' by getting the sum of logical vector. Similarly, the mean of 'values' where 'tag' is 'B'

library(tidyverse)
df %>%
  mutate(tag = tags) %>% 
  unnest %>% 
  summarise(nAC = sum(tag %in% c("A", "C")), 
            meanB = mean(values[tag == "B"], na.rm = TRUE))

Upvotes: 1

BENY
BENY

Reputation: 323356

That is not very hard . you just need assign your list to your df create a new columns as name tags then we do unnest, I have list the solutions for your listed questions .

library(tidyr)
library(dplyr)

df$tags=list(
     c("A"),
     NA,
     c("A", "B", "C"),
     c("B", "C")
 )
Newdf=df%>%tidyr::unnest(tags)

Q1.

Newdf%>%group_by(tags)%>%summarise(Mean=mean(values))%>%filter(tags=='B')
   tags              Mean
  <chr>             <dbl>
1     B 0.263927925960161

Q2.

Newdf%>%group_by(id)%>%dplyr::summarise(Count=any(tags=='A')&any(tags=='C'))
# A tibble: 4 x 2
     id Count
  <int> <lgl>
1     1 FALSE
2     2    NA
3     3  TRUE
4     4 FALSE

Upvotes: 1

Related Questions