geoscience123
geoscience123

Reputation: 230

How to count the number of matches in a tibble list column to another data frame in R?

I have two sets of data:

The first data frame (small) is relatively smaller than the second data frame (large). Each data frame has an id column with unique identifiers. The smaller data frame has a list column called links, which contains a list of links to the larger second data frame. The larger data frame has a column of attributes, we'll call att:

library(tidyverse)

a <- c(3, 3, NA, 5)
b <- c(NA, 3, 4, 5)

small <- tibble(id = c(1, 2),
                links = list(a, b))

large <- tibble(id = c(3, 4, 5),
                att = c("yes", "no", "maybe"))

My goal is to count the number of times each observation in the small data frame has links to observations with the "yes" attribute in the large data frame.

I feel like something like this is on the right track, but it isn't quite counting correctly:

counted <- small %>%
  mutate(count_yes = map_int(links, ~ sum(large$att[large$id %in% .x] == "yes")))

print(counted)
#> # A tibble: 2 × 3
#>      id links     count_yes
#>   <dbl> <list>        <int>
#> 1     1 <dbl [4]>         1
#> 2     2 <dbl [4]>         1

Here, count_yes appears as only 1, when it should read as a 2 and a 1.

Upvotes: 4

Views: 49

Answers (2)

SamR
SamR

Reputation: 20494

As you're looking for a solution, I think an expressive way here is to tidyr::unnest() the list column here, then left_join() to large and summarise():

small |>
    tidyr::unnest(links) |>
    left_join(large, by = c("links" = "id")) |>
    summarise(
        links = list(links),
        count_yes = sum(att == "yes", na.rm = TRUE), .by = id
    )

# # A tibble: 2 × 3
#      id links     count_yes
#   <dbl> <list>        <int>
# 1     1 <dbl [4]>         2
# 2     2 <dbl [4]>         1

Though I'd rather just keep the data in long form rather than doing the last step, unless there's a very good reason to work with list columns, as this will avoid the need to iterate over a column with map*() or *apply() functions.

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389205

You are on the right track but need some adjustment.

small %>%
  mutate(count_yes = map_int(links, ~sum(.x %in% large$id[large$att %in% "yes"])))

#     id links     count_yes
#  <dbl> <list>        <int>
#1     1 <dbl [4]>         2
#2     2 <dbl [4]>         1

Or in base R :

sapply(small$links, \(x) sum(x %in% large$id[large$att %in% "yes"]))

Note the use of %in% instead of == would return FALSE for NA values.

Upvotes: 3

Related Questions