Reputation: 1007

count values in one dataframe in another dataframe

Assume I have a dataframe like this:

df <- data.frame(
  D = c('A', 'B', 'C'),
  Q = c('asd', 'reg', 'rt'),
  id = I(list(c(124, 532, 78), c(1, 3, 532), c(2, 3, 78, 124, 1)))
)

and another one like this:

clusters <- data.frame(
 g = I(list(c(124, 78, 1),c(2, 3, 89),c(532, 533)))
)

I want to count the times the elements in each id list are in the groups g1, g2, g3 (each row in clusters) when it's more than once. So basically count the intersection whenever the intersection > 1.

The output should be:

out <- data.frame(
  D = c('A', 'B', 'C'),
  Q = c('asd', 'reg', 'rt'),
  id = I(list(c(124, 532, 78), c(1, 3, 532), c(2,3,78, 124, 1))),
  count = c(2, 0, 5)
)

I cannot do it with for loop cause it will take forever as I have 70k rows.

A for loop would look like this:

for row in df:
  for group in clusters:
     if group intersect row$id > 1
        count=count+intersection
  count=0

Upvotes: 2

Answers (3)

M--

Reputation: 29153

Here's another tidyverse solution, but you're probably better off using base solutions provided for performance.

library(tidyverse)

df %>% 
  unnest(id) %>% 
  left_join({clusters %>% 
      rownames_to_column("r") %>% 
      unnest(g)}, by = c("id" = "g")) %>% 
  add_count(D, Q, r) %>% 
  summarise(id = list(id), n = first(n * (n > 1)), .by = c(D, Q, r)) %>% 
  summarise(id = list(unlist(id)), count = sum(n) , .by = c(D, Q)) %>% 
  as.data.frame()

#>   D   Q               id count
#> 1 A asd     124, 78, 532     2
#> 2 B reg        1, 3, 532     0
#> 3 C  rt 2, 3, 78, 124, 1     5

^{Created on 2024-01-22 with reprex v2.0.2}

Upvotes: 1

r2evans

Reputation: 160687

There's no way around some bespoke code.

sapply(df$id, function(id1) {
  lens <- sapply(clusters$g, function(z) length(intersect(z, id1)))
  sum(replace(lens, lens == 1, 0))
})
# [1] 2 0 5

df <- structure(list(D = c("A", "B", "C"), Q = c("asd", "reg", "rt"), id = structure(list(c(124, 532, 78), c(1, 3, 532), c(2, 3, 78, 124, 1)), class = "AsIs")), class = "data.frame", row.names = c(NA, -3L))

Upvotes: 2

ThomasIsCoding

Reputation: 102529

Probably you can try base R

transform(
    df,
    count = rowSums(outer(id, clusters$g, \(...) mapply(\(...) {
        l <- length(intersect(...))
        l * (l > 1)
    }, ...)))
)

or dplyr

df %>%
    left_join(df %>%
        unnest(id) %>%
        left_join(clusters %>%
            mutate(grp = row_number()) %>%
            unnest(g), by = join_by(id == g)) %>%
        summarise(count = {
            d <- table(grp)
            sum(d[d > 1])
        }, .by = c(D, Q)))

which gives

  D   Q           id count
1 A asd 124, 532, 78     2
2 B reg    1, 3, 532     0
3 C  rt 2, 3, 78....     5

Upvotes: 2

count values in one dataframe in another dataframe

Answers (3)

Related Questions