Reputation: 230
I have two sets of data:
The first data frame (small
) is relatively smaller than the second data frame (large
). Each data frame has an id
column with unique identifiers. The smaller data frame has a list column called links
, which contains a list of links to the larger second data frame. The larger data frame has a column of attributes, we'll call att
:
library(tidyverse)
a <- c(3, 3, NA, 5)
b <- c(NA, 3, 4, 5)
small <- tibble(id = c(1, 2),
links = list(a, b))
large <- tibble(id = c(3, 4, 5),
att = c("yes", "no", "maybe"))
My goal is to count the number of times each observation in the small
data frame has links to observations with the "yes"
attribute in the large
data frame.
I feel like something like this is on the right track, but it isn't quite counting correctly:
counted <- small %>%
mutate(count_yes = map_int(links, ~ sum(large$att[large$id %in% .x] == "yes")))
print(counted)
#> # A tibble: 2 × 3
#> id links count_yes
#> <dbl> <list> <int>
#> 1 1 <dbl [4]> 1
#> 2 2 <dbl [4]> 1
Here, count_yes
appears as only 1, when it should read as a 2 and a 1.
Upvotes: 4
Views: 49
Reputation: 20494
As you're looking for a tidyverse solution, I think an expressive way here is to tidyr::unnest()
the list column here, then left_join()
to large
and summarise()
:
small |>
tidyr::unnest(links) |>
left_join(large, by = c("links" = "id")) |>
summarise(
links = list(links),
count_yes = sum(att == "yes", na.rm = TRUE), .by = id
)
# # A tibble: 2 × 3
# id links count_yes
# <dbl> <list> <int>
# 1 1 <dbl [4]> 2
# 2 2 <dbl [4]> 1
Though I'd rather just keep the data in long form rather than doing the last step, unless there's a very good reason to work with list columns, as this will avoid the need to iterate over a column with map*()
or *apply()
functions.
Upvotes: 1
Reputation: 389205
You are on the right track but need some adjustment.
small %>%
mutate(count_yes = map_int(links, ~sum(.x %in% large$id[large$att %in% "yes"])))
# id links count_yes
# <dbl> <list> <int>
#1 1 <dbl [4]> 2
#2 2 <dbl [4]> 1
Or in base R :
sapply(small$links, \(x) sum(x %in% large$id[large$att %in% "yes"]))
Note the use of %in%
instead of ==
would return FALSE
for NA
values.
Upvotes: 3