Reputation: 137
I have two data frames, both containing lists of numbers within cells. I'm trying populate columns in df_1 with the number of times an item from ITEM_ID_LIST within df_1 appears in corresponding column within df_2.
Here is the first data.frame:
ID ITEM_ID_LIST COL_1 COL_2
3001 c(2008, 2011, 2006, 2007) NA NA
3002 c(2012, 2011, 2003, 2004) NA NA
Here is the second data.frame:
COL_1 COL_2
c(2004, 2010, 2011) c(2011)
And, this is what I am trying to get:
ID ITEM_ID_LIST COL_1 COL_2
3001 c(2008, 2011, 2006, 2007) 1 1
3002 c(2012, 2011, 2003, 2004) 2 1
So, for example, for the first row, only 2011 from ITEM_ID_LIST occurs within COL_1 and COL_2 of df_2. And, for row two, 2004 and 2011 both occur in COL_1 of df_2 but only 2011 occurs in COL_2 of df_2.
The actual data has a lot more rows and columns, so I am trying to avoid looping over individual cell. And, I have them set up as data.frame, because I thought that might be easier. But a solution involving matrices would also work.
Here is the dput:
df_1 <- structure(list(ID = 3001:3002,
ITEM_ID_LIST = list(c(2008L, 2011L, 2006L, 2007L),
c(2012L, 2011L, 2003L, 2004L)),
COL_1 = c(NA, NA),
COL_2 = c(NA, NA)),
row.names = c(NA, -2L),
class = "data.frame")
df_2 <- structure(list(COL_1 = list(c(2004L, 2010L, 2011L)),
COL_2 = list(2011L)),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA, -1L))
Upvotes: 0
Views: 41
Reputation: 2143
I hope this will address your concerns about having many columns; this should work no matter how many you have.
I unnest
ed both df.
long1 <- df_1 %>%
select(ID, ITEM_ID_LIST) %>%
unnest() %>%
arrange(ID, ITEM_ID_LIST)
ID ITEM_ID_LIST
1 3001 2006
2 3001 2007
3 3001 2008
4 3001 2011
5 3002 2003
6 3002 2004
7 3002 2011
8 3002 2012
long2 <- df_2 %>%
gather(key = "COL", value = "LVAL", starts_with("COL")) %>%
unnest()
# A tibble: 4 x 2
COL LVAL
<chr> <int>
1 COL_1 2004
2 COL_1 2010
3 COL_1 2011
4 COL_2 2011
I inner_join
ed them
long1 %>%
inner_join(long2, by = c(ITEM_ID_LIST = "LVAL")) %>%
group_by(ID, COL) %>%
# Count how many years they had in common.
summarise(COL.COUNT = length(ITEM_ID_LIST)) %>%
# Now back to your wide format
spread(key = COL, value = COL.COUNT) %>%
ungroup
And now we have:
# A tibble: 2 x 3
ID COL_1 COL_2
<int> <int> <int>
1 3001 1 1
2 3002 2 1
Upvotes: 1
Reputation: 887851
We can use map
library(purrr)
library(dplyr)
df_1 %>%
mutate(COL_1 = map_int(ITEM_ID_LIST, ~ sum(.x %in% df_2$COL_1[[1]])),
COL_2 = map_int(ITEM_ID_LIST, ~ sum(.x %in% df_2$COL_2[[1]])))
# ID ITEM_ID_LIST COL_1 COL_2
#1 3001 2008, 2011, 2006, 2007 1 1
#2 3002 2012, 2011, 2003, 2004 2 1
Or we unnest
the second dataset, loop over the columns
library(tidyr)
df_2 %>%
unnest(everything()) %>%
summarise_all(~ {x1 <- .
map(df_1$ITEM_ID_LIST, ~ sum(.x %in% x1))}) %>%
unnest(everything()) %>%
bind_cols(df_1 %>%
select(ID, ITEM_ID_LIST), .)
# A tibble: 2 x 4
# ID ITEM_ID_LIST COL_1 COL_2
# <int> <list> <int> <int>
#1 3001 <int [4]> 1 1
#2 3002 <int [4]> 2 1
Or using a simple for
loop
for(nm in names(df_2)) df_1 <- df_1 %>%
mutate(!! nm := map_int(ITEM_ID_LIST, ~ sum(.x %in% df_2[[nm]][[1]])))
Upvotes: 0