Reputation: 13
Having gotten my data into the format:
pId fId
1 1 0
2 1 108
3 1 940
4 1 972
5 1 993
6 2 0
7 3 0
8 3 32
9 3 108
10 3 176
My goal is to try and (for a much longer set of data) determine which fIds each pId has in common with each other, and from that how many they have in common. My plan was to try and summarise into singular rows of pId where each fId is a list of fIds, and then loop a function like intersect() or of similar nature across that, for an ideal ouput of format:
pId1 pId2 together
1 1 2 1
2 1 3 2
3 1 4 N
4 2 3 1
etc....
EDIT: trying to work with the data in one of these ways
pId allfId allfIdSplit
1 1 0,901,940,972,993 c("0", "901", "940", "972", "993")
2 2 0 0
3 3 0,32,108,176 c("0", "32", "108", "176")
4 4 0,200,561,602,629,772,825,991 c("0", "200", "561", "602", "629", "772", "825", "991")
5 5 0 0
With code that I had so far, where df_a is startng point as shown above to give the output shown in the edit:
df_c <- df_a %>%
group_by(pId) %>%
arrange(pId) %>%
summarize(allFlights = paste(unique(flightId), collapse = ",")) %>%
mutate(allFlightsSplit = str_split(allFlights, ",")) %>%
print()
Upvotes: 1
Views: 156
Reputation: 11957
Here's one way to do it in the tidyverse. See comments in the code.
library(tidyverse)
library(magrittr)
df.counts <- combn(unique(df$pId), 2) %>% # unique combinations of pIDs
t %>% # transform to columns
as.data.frame() %>% # to data frame
set_colnames(c('pId1', 'pId2')) %>% # name the columns
left_join(df, by = c(pId1 = 'pId')) %>% # join the original data to pId1
left_join(df, by = c(pId2 = 'pId')) %>% #join original data to pId2
filter(fId.x == fId.y) %>% # get rid of duplicates
count(pId1, pId2) # count
pId1 pId2 n
<int> <int> <int>
1 1 2 1
2 1 3 2
3 2 3 1
Loops are usually not the best way to handle these types of problems in R, but since operations like combn
seem to be too expensive on your real data, this may be more performant.
pids <- unique(df$pId)
result <- list()
for (x in pids) {
for (y in setdiff(pids, x)) {
x.vals <- df$fId[df$pId == x]
y.vals <- df$fId[df$pId == y]
together <- length(intersect(x.vals, y.vals))
result[[length(result) + 1]] <- data.frame(pId1 = x, pId2 = y, together = together)
}
}
df.new <- do.call(rbind, result)
pId1 pId2 together
1 1 3 2
2 2 3 1
3 3 2 1
And here is a version that preallocates the size of the final data frame, which may be even more performant:
pids <- unique(df$pId)
result <- data.frame(pId1 = rep(NA, length(pids) * (length(pids) - 1) / 2), pId2 = NA, together = NA)
row.num <- 1
for (x in pids) {
for (y in setdiff(pids, x)) {
x.vals <- df$fId[df$pId == x]
y.vals <- df$fId[df$pId == y]
together <- length(intersect(x.vals, y.vals))
result[row.num, 'pId1'] <- x
result[row.num, 'pId2'] <- y
result[row.num, 'together'] <- together
row.num <- row.num + 1
}
}
Upvotes: 1