How to find a number of combinations (not permutations) of values in two columns in R?

I have a dataset, where each row represents a project. Several people may work on every project (between 2 and 6).

The column names are id, person_1, person_2, person_3, person_4, person_5, person_6

If less than 6 people working on a project, only first n columns are filled, rest is NA.

I need to calculate how many times each pair of individuals worked together, what projects they worked together on, and for each pair of people, A and B what is the share of A's total projects were with B and vice-a-versa.

The issues I am stuck with is that: 1) if I have person_1 A and person_2 B and person_1 B and person_2 B it counts it as two different pairs while I need it to understand that it's the same pair. 2) Since most projects only involve 2 people, I tried the pairs with only 2 columns, but I need to incorporate columns person_3 - person_6.

I also have this database in a different format, where each row represents one person working on one project. In that case, I have all people in the same column, so my columns are project_id and name.

EDIT: example below

Data <- data.frame(
    id = c(1:4),
    person_1 = c("John", "Dan", "Peter", "James"),
person_2 = c("Dan", "John", "Kate", "Lisa"),
person_3 = c(NA, NA, "Kate", NA),
person_4 = c(NA, NA, "Peter", NA),
person_5 = c(NA, NA, NA, NA),
person_6 = c(NA, NA, NA, NA)
)

EDIT2: In my actual dataset I have 667 unique people and 2442 projects.

Upvotes: 0

Views: 267

Answers (2)

emilliman5
emilliman5

Reputation: 5956

This is quite straight forward, using matrices, specifically an incidence and adjacency matrix. I will note, that this will work better with your 2 column dataset format.

1) Convert the table to an incidence matrix (1 project per row and 1 person per column), and tabulate the projects each person are assigned.

data_long <- reshape2::melt(Data, id.vars="proj_id", na.rm=TRUE)
proj_mat <- as.data.frame.table(table(data_long[-2]))
proj_mat <- reshape2::dcast(proj_mat, proj_id ~ value, value.var = "Freq")

#  proj_id Dan James John Kate Lisa Peter
#1      P1   1     0    1    0    0     0
#2      P2   1     0    1    0    0     0
#3      P3   0     0    0    2    0     2
#4      P4   0     1    0    0    1     0

2) To know how many projects each pair of people worked on together you can simply calculate the jaccard index.

proxy::dist(t(proj_mat[-1]), method = "jaccard")

#      Dan James John Kate Lisa
#James   1                     
#John    0     1               
#Kate    1     1    1          
#Lisa    1     0    1    1     
#Peter   1     1    1    0    1

3) To get the proportion of shared projects we calculate the adjacency matrix and then divide by it's diagonal

proj_adj <- t(proj_mat[-1]) %*% as.matrix((proj_mat[-1]))
prop_together <- proj_adj/diag(proj_adj)
prop_together
#      Dan James John Kate Lisa Peter
#Dan     1     0    1    0    0     0
#James   0     1    0    0    1     0
#John    1     0    1    0    0     0
#Kate    0     0    0    1    0     1
#Lisa    0     1    0    0    1     0
#Peter   0     0    0    1    0     1

Two clarifications:
1) Kate and Peter are assigned to the same project twice. Are these 4 different people? Do you expect them to have 2 projects in common or 1?

2) I don't follow your first issue. Do "A" and "B" refer to projects or their position in your initial table? Can you point out an example in your sample data or add an example to your sample data.

if I have person_1 A and person_2 B and person_1 B and person_2 B it
counts it as two different pairs while I need it to understand that
it's the same pair.

Upvotes: 1

Grada Gukovic
Grada Gukovic

Reputation: 1253

getCombs <- function(..., data = dat[,-1]){
    names <- list(...)
    pers1InProj <- vapply(names, function (name)  
                                 {rowSums(data == name, na.rm = T) > 0}, 
                                  numeric(nrow(data)))


    return( rowSums(pers1InProj) == length(names)) 
}


An example:

> Data
  id person_1 person_2 person_3 person_4 person_5 person_6
1  1     John      Dan     <NA>     <NA>       NA       NA
2  2      Dan     John     <NA>     <NA>       NA       NA
3  3    Peter     Kate     Kate    Peter       NA       NA
4  4    James     Lisa     <NA>     <NA>       NA       NA
> which(getCombs("John", "Dan", data = Data))
 [1] 1 2
> which(getCombs("Peter", "John", "Dan", data = Data))
 integer(0)

which(getCombs("Peter", "Kate", data = Data))
 [1] 3

Code for all pairs:

persons <- c("Peter", "John", "Dan", "James", "Kate", "Lisa")

tmp <- combn(persons, 2); #create all pairs of two elements from persons
tmp
     [,1]    [,2]    [,3]    [,4]    [,5]    [,6]   [,7]    [,8]   [,9]  
[1,] "Peter" "Peter" "Peter" "Peter" "Peter" "John" "John"  "John" "John"
[2,] "John"  "Dan"   "James" "Kate"  "Lisa"  "Dan"  "James" "Kate" "Lisa"
     [,10]   [,11]  [,12]  [,13]   [,14]   [,15] 
[1,] "Dan"   "Dan"  "Dan"  "James" "James" "Kate"
[2,] "James" "Kate" "Lisa" "Kate"  "Lisa"  "Lisa"

# apply getCombs to every column fo tmp:
res <- apply(tmp, 2, function(pair) getCombs(pair[1], pair[2], data = Data))

# set colnames(res) for simplicity:
colnames(res) <- paste(tmp[1,], tmp[2,], sep="_")

# for example the sixth pair is c("John", "Dan"), they work together on 
 #project1 and project2 therefore res[1,6] == res[2,6]==TRUE:

> res
     Peter_John Peter_Dan Peter_James Peter_Kate Peter_Lisa John_Dan John_James John_Kate
[1,]      FALSE     FALSE       FALSE      FALSE      FALSE     TRUE      FALSE     FALSE
[2,]      FALSE     FALSE       FALSE      FALSE      FALSE     TRUE      FALSE     FALSE
[3,]      FALSE     FALSE       FALSE       TRUE      FALSE    FALSE      FALSE     FALSE
[4,]      FALSE     FALSE       FALSE      FALSE      FALSE    FALSE      FALSE     FALSE
     John_Lisa Dan_James Dan_Kate Dan_Lisa James_Kate James_Lisa Kate_Lisa
[1,]     FALSE     FALSE    FALSE    FALSE      FALSE      FALSE     FALSE
[2,]     FALSE     FALSE    FALSE    FALSE      FALSE      FALSE     FALSE
[3,]     FALSE     FALSE    FALSE    FALSE      FALSE      FALSE     FALSE
[4,]     FALSE     FALSE    FALSE    FALSE      FALSE       TRUE     FALSE
> 
# number of times people worked together: 
 colSums(res)
 Peter_John   Peter_Dan Peter_James  Peter_Kate  Peter_Lisa    John_Dan  John_James 
          0           0           0           1           0           2           0 
  John_Kate   John_Lisa   Dan_James    Dan_Kate    Dan_Lisa  James_Kate  James_Lisa 
          0           0           0           0           0           0           1 
  Kate_Lisa 
          0 

Solution if res doesnt fit into menory:

count <- integer(ncol(tmp))
names(count) <- colnames(res)
for(i in 1:ncol(tmp)) count[i] <- sum(getCombs(tmp[1,i], tmp[2,i], data = Data))

> count
 Peter_John   Peter_Dan Peter_James  Peter_Kate  Peter_Lisa    John_Dan  John_James 
          0           0           0           1           0           2           0 
  John_Kate   John_Lisa   Dan_James    Dan_Kate    Dan_Lisa  James_Kate  James_Lisa 
          0           0           0           0           0           0           1 
  Kate_Lisa 
          0 

 all.equal(count, colSums(res))
[1] TRUE

Upvotes: 0

Related Questions