Reputation: 137

cross referencing list stored in column against lists stored in another data frame in R

I have two data frames, both containing lists of numbers within cells. I'm trying populate columns in df_1 with the number of times an item from ITEM_ID_LIST within df_1 appears in corresponding column within df_2.

Here is the first data.frame:

ID      ITEM_ID_LIST                COL_1      COL_2
3001    c(2008, 2011, 2006, 2007)   NA         NA
3002    c(2012, 2011, 2003, 2004)   NA         NA

Here is the second data.frame:

  COL_1                       COL_2
  c(2004, 2010, 2011)         c(2011)

And, this is what I am trying to get:

ID      ITEM_ID_LIST                COL_1      COL_2
3001    c(2008, 2011, 2006, 2007)   1          1
3002    c(2012, 2011, 2003, 2004)   2          1

So, for example, for the first row, only 2011 from ITEM_ID_LIST occurs within COL_1 and COL_2 of df_2. And, for row two, 2004 and 2011 both occur in COL_1 of df_2 but only 2011 occurs in COL_2 of df_2.

The actual data has a lot more rows and columns, so I am trying to avoid looping over individual cell. And, I have them set up as data.frame, because I thought that might be easier. But a solution involving matrices would also work.

Here is the dput:

df_1 <- structure(list(ID = 3001:3002, 
                       ITEM_ID_LIST = list(c(2008L, 2011L, 2006L, 2007L), 
                                           c(2012L, 2011L, 2003L, 2004L)), 
                       COL_1 = c(NA, NA), 
                       COL_2 = c(NA, NA)), 
                  row.names = c(NA, -2L), 
                  class = "data.frame")

df_2 <- structure(list(COL_1 = list(c(2004L, 2010L, 2011L)), 
                       COL_2 = list(2011L)), 
                  class = c("tbl_df", "tbl", "data.frame"), 
                  row.names = c(NA, -1L))

Upvotes: 0

Answers (2)

David T

Reputation: 2143

I hope this will address your concerns about having many columns; this should work no matter how many you have.

I unnested both df.

long1 <- df_1 %>%
  select(ID, ITEM_ID_LIST) %>% 
  unnest() %>% 
  arrange(ID, ITEM_ID_LIST)

    ID ITEM_ID_LIST
1 3001         2006
2 3001         2007
3 3001         2008
4 3001         2011
5 3002         2003
6 3002         2004
7 3002         2011
8 3002         2012

long2 <- df_2 %>% 
  gather(key = "COL", value = "LVAL", starts_with("COL")) %>% 
  unnest()

# A tibble: 4 x 2
  COL    LVAL
  <chr> <int>
1 COL_1  2004
2 COL_1  2010
3 COL_1  2011
4 COL_2  2011

I inner_joined them

long1 %>% 
  inner_join(long2, by = c(ITEM_ID_LIST = "LVAL")) %>% 
  group_by(ID, COL) %>% 
     # Count how many years they had in common.
  summarise(COL.COUNT = length(ITEM_ID_LIST)) %>% 
    # Now back to your wide format
  spread(key = COL, value = COL.COUNT) %>% 
  ungroup

And now we have:

# A tibble: 2 x 3
     ID COL_1 COL_2
  <int> <int> <int>
1  3001     1     1
2  3002     2     1

Upvotes: 1

akrun

Reputation: 887851

We can use map

library(purrr)
library(dplyr)
df_1 %>%
  mutate(COL_1 = map_int(ITEM_ID_LIST, ~  sum(.x %in% df_2$COL_1[[1]])),
         COL_2 = map_int(ITEM_ID_LIST, ~  sum(.x %in% df_2$COL_2[[1]])))
#   ID           ITEM_ID_LIST COL_1 COL_2
#1 3001 2008, 2011, 2006, 2007     1     1
#2 3002 2012, 2011, 2003, 2004     2     1

Or we unnest the second dataset, loop over the columns

library(tidyr)
df_2 %>%
   unnest(everything()) %>% 
   summarise_all(~ {x1 <- .
          map(df_1$ITEM_ID_LIST, ~ sum(.x %in% x1))}) %>%
   unnest(everything()) %>%
   bind_cols(df_1 %>% 
                select(ID, ITEM_ID_LIST), .)
# A tibble: 2 x 4
#     ID ITEM_ID_LIST COL_1 COL_2
#  <int> <list>       <int> <int>
#1  3001 <int [4]>        1     1
#2  3002 <int [4]>        2     1

Or using a simple for loop

for(nm in names(df_2))  df_1 <- df_1 %>% 
    mutate(!! nm := map_int(ITEM_ID_LIST, ~ sum(.x %in% df_2[[nm]][[1]])))

Upvotes: 0

cross referencing list stored in column against lists stored in another data frame in R

Answers (2)

Related Questions