Alex
Alex

Reputation: 13

How to summarise one set of id values in a dataframe grouped by another set of ids

Having gotten my data into the format:

    pId   fId   
1   1     0     
2   1     108   
3   1     940   
4   1     972   
5   1     993   
6   2     0     
7   3     0     
8   3     32    
9   3     108   
10  3     176

My goal is to try and (for a much longer set of data) determine which fIds each pId has in common with each other, and from that how many they have in common. My plan was to try and summarise into singular rows of pId where each fId is a list of fIds, and then loop a function like intersect() or of similar nature across that, for an ideal ouput of format:

   pId1   pId2  together
1   1     2     1
2   1     3     2
3   1     4     N
4   2     3     1

etc....

EDIT: trying to work with the data in one of these ways

   pId  allfId                          allfIdSplit
1   1   0,901,940,972,993               c("0", "901", "940", "972", "993")
2   2   0                               0
3   3   0,32,108,176                    c("0", "32", "108", "176")
4   4   0,200,561,602,629,772,825,991   c("0", "200", "561", "602", "629", "772", "825", "991")
5   5   0                               0

With code that I had so far, where df_a is startng point as shown above to give the output shown in the edit:

df_c <- df_a %>%
  group_by(pId) %>%
  arrange(pId) %>%
  summarize(allFlights = paste(unique(flightId), collapse = ",")) %>%
  mutate(allFlightsSplit = str_split(allFlights, ",")) %>%
  print()

Upvotes: 1

Views: 156

Answers (1)

jdobres
jdobres

Reputation: 11957

Here's one way to do it in the tidyverse. See comments in the code.

library(tidyverse)
library(magrittr)

df.counts <- combn(unique(df$pId), 2) %>% # unique combinations of pIDs
  t %>% # transform to columns
  as.data.frame() %>% # to data frame
  set_colnames(c('pId1', 'pId2')) %>%  # name the columns
  left_join(df, by = c(pId1 = 'pId')) %>% # join the original data to pId1
  left_join(df, by = c(pId2 = 'pId')) %>% #join original data to pId2
  filter(fId.x == fId.y) %>% # get rid of duplicates
  count(pId1, pId2) # count

  pId1  pId2     n
  <int> <int> <int>
1     1     2     1
2     1     3     2
3     2     3     1

An alternative using loops

Loops are usually not the best way to handle these types of problems in R, but since operations like combn seem to be too expensive on your real data, this may be more performant.

pids <- unique(df$pId)

result <- list()

for (x in pids) {
  for (y in setdiff(pids, x)) {
    x.vals <- df$fId[df$pId == x]
    y.vals <- df$fId[df$pId == y]
    together <- length(intersect(x.vals, y.vals))
    result[[length(result) + 1]] <- data.frame(pId1 = x, pId2 = y, together = together)
  }
}

df.new <- do.call(rbind, result)

  pId1 pId2 together
1    1    3        2
2    2    3        1
3    3    2        1

And here is a version that preallocates the size of the final data frame, which may be even more performant:

pids <- unique(df$pId)
result <- data.frame(pId1 = rep(NA, length(pids) * (length(pids) - 1) / 2), pId2 = NA, together = NA)
row.num <- 1
for (x in pids) {
  for (y in setdiff(pids, x)) {
    x.vals <- df$fId[df$pId == x]
    y.vals <- df$fId[df$pId == y]
    together <- length(intersect(x.vals, y.vals))
    result[row.num, 'pId1'] <- x
    result[row.num, 'pId2'] <- y
    result[row.num, 'together'] <- together
    row.num <- row.num + 1
  }
}

Upvotes: 1

Related Questions