Reputation: 1
I have a dataset, where each row represents a project. Several people may work on every project (between 2 and 6).
The column names are id, person_1, person_2, person_3, person_4, person_5, person_6
If less than 6 people working on a project, only first n columns are filled, rest is NA.
I need to calculate how many times each pair of individuals worked together, what projects they worked together on, and for each pair of people, A and B what is the share of A's total projects were with B and vice-a-versa.
The issues I am stuck with is that: 1) if I have person_1 A and person_2 B and person_1 B and person_2 B it counts it as two different pairs while I need it to understand that it's the same pair. 2) Since most projects only involve 2 people, I tried the pairs with only 2 columns, but I need to incorporate columns person_3 - person_6.
I also have this database in a different format, where each row represents one person working on one project. In that case, I have all people in the same column, so my columns are project_id and name.
EDIT: example below
Data <- data.frame(
id = c(1:4),
person_1 = c("John", "Dan", "Peter", "James"),
person_2 = c("Dan", "John", "Kate", "Lisa"),
person_3 = c(NA, NA, "Kate", NA),
person_4 = c(NA, NA, "Peter", NA),
person_5 = c(NA, NA, NA, NA),
person_6 = c(NA, NA, NA, NA)
)
EDIT2: In my actual dataset I have 667 unique people and 2442 projects.
Upvotes: 0
Views: 267
Reputation: 5956
This is quite straight forward, using matrices, specifically an incidence and adjacency matrix. I will note, that this will work better with your 2 column dataset format.
1) Convert the table to an incidence matrix (1 project per row and 1 person per column), and tabulate the projects each person are assigned.
data_long <- reshape2::melt(Data, id.vars="proj_id", na.rm=TRUE)
proj_mat <- as.data.frame.table(table(data_long[-2]))
proj_mat <- reshape2::dcast(proj_mat, proj_id ~ value, value.var = "Freq")
# proj_id Dan James John Kate Lisa Peter
#1 P1 1 0 1 0 0 0
#2 P2 1 0 1 0 0 0
#3 P3 0 0 0 2 0 2
#4 P4 0 1 0 0 1 0
2) To know how many projects each pair of people worked on together you can simply calculate the jaccard index.
proxy::dist(t(proj_mat[-1]), method = "jaccard")
# Dan James John Kate Lisa
#James 1
#John 0 1
#Kate 1 1 1
#Lisa 1 0 1 1
#Peter 1 1 1 0 1
3) To get the proportion of shared projects we calculate the adjacency matrix and then divide by it's diagonal
proj_adj <- t(proj_mat[-1]) %*% as.matrix((proj_mat[-1]))
prop_together <- proj_adj/diag(proj_adj)
prop_together
# Dan James John Kate Lisa Peter
#Dan 1 0 1 0 0 0
#James 0 1 0 0 1 0
#John 1 0 1 0 0 0
#Kate 0 0 0 1 0 1
#Lisa 0 1 0 0 1 0
#Peter 0 0 0 1 0 1
Two clarifications:
1) Kate and Peter are assigned to the same project twice. Are these 4 different people? Do you expect them to have 2 projects in common or 1?
2) I don't follow your first issue. Do "A" and "B" refer to projects or their position in your initial table? Can you point out an example in your sample data or add an example to your sample data.
if I have person_1 A and person_2 B and person_1 B and person_2 B it
counts it as two different pairs while I need it to understand that
it's the same pair.
Upvotes: 1
Reputation: 1253
getCombs <- function(..., data = dat[,-1]){
names <- list(...)
pers1InProj <- vapply(names, function (name)
{rowSums(data == name, na.rm = T) > 0},
numeric(nrow(data)))
return( rowSums(pers1InProj) == length(names))
}
An example:
> Data
id person_1 person_2 person_3 person_4 person_5 person_6
1 1 John Dan <NA> <NA> NA NA
2 2 Dan John <NA> <NA> NA NA
3 3 Peter Kate Kate Peter NA NA
4 4 James Lisa <NA> <NA> NA NA
> which(getCombs("John", "Dan", data = Data))
[1] 1 2
> which(getCombs("Peter", "John", "Dan", data = Data))
integer(0)
which(getCombs("Peter", "Kate", data = Data))
[1] 3
Code for all pairs:
persons <- c("Peter", "John", "Dan", "James", "Kate", "Lisa")
tmp <- combn(persons, 2); #create all pairs of two elements from persons
tmp
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] "Peter" "Peter" "Peter" "Peter" "Peter" "John" "John" "John" "John"
[2,] "John" "Dan" "James" "Kate" "Lisa" "Dan" "James" "Kate" "Lisa"
[,10] [,11] [,12] [,13] [,14] [,15]
[1,] "Dan" "Dan" "Dan" "James" "James" "Kate"
[2,] "James" "Kate" "Lisa" "Kate" "Lisa" "Lisa"
# apply getCombs to every column fo tmp:
res <- apply(tmp, 2, function(pair) getCombs(pair[1], pair[2], data = Data))
# set colnames(res) for simplicity:
colnames(res) <- paste(tmp[1,], tmp[2,], sep="_")
# for example the sixth pair is c("John", "Dan"), they work together on
#project1 and project2 therefore res[1,6] == res[2,6]==TRUE:
> res
Peter_John Peter_Dan Peter_James Peter_Kate Peter_Lisa John_Dan John_James John_Kate
[1,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[3,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
John_Lisa Dan_James Dan_Kate Dan_Lisa James_Kate James_Lisa Kate_Lisa
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
>
# number of times people worked together:
colSums(res)
Peter_John Peter_Dan Peter_James Peter_Kate Peter_Lisa John_Dan John_James
0 0 0 1 0 2 0
John_Kate John_Lisa Dan_James Dan_Kate Dan_Lisa James_Kate James_Lisa
0 0 0 0 0 0 1
Kate_Lisa
0
Solution if res doesnt fit into menory:
count <- integer(ncol(tmp))
names(count) <- colnames(res)
for(i in 1:ncol(tmp)) count[i] <- sum(getCombs(tmp[1,i], tmp[2,i], data = Data))
> count
Peter_John Peter_Dan Peter_James Peter_Kate Peter_Lisa John_Dan John_James
0 0 0 1 0 2 0
John_Kate John_Lisa Dan_James Dan_Kate Dan_Lisa James_Kate James_Lisa
0 0 0 0 0 0 1
Kate_Lisa
0
all.equal(count, colSums(res))
[1] TRUE
Upvotes: 0