Reputation: 495
I have a list of individuals, charities, and years. I am trying to find out how many times individual i
overlaps with individual j
in a given charity and year. I would like to make a square matrix for every year and have any given cell tell me the number of overlaps.
Example of Data:
Individual Year Charity
1 2003 A
2 2003 A
2 2003 B
2 2005 A
... ... ...
17 2003 A
17 2003 B
Wanted Result 2003 (for every year):
Individual Individual_1 Individual_2 ... Individual_17
1 . 1 1
2 1 . 2
... ... ... ...
17 1 2 .
I have heard that R is best for network data, but right now using Stata, I created a variable for each individual and then I am running an if statement
that looks in the [_n+x]
cell for the individual in the given column and places a one. I was then going to aggregate these data. This seems to be working but is very time intensive and I am sure there could be an error.
qui forval j = 1/1750 {
gen individual_`j'= 0
}
qui forval j = 1/1750 {
replace individual_`j' = 1 if individual[_n+`j'] == 1 & year == 2002 & charity == "A"
}
qui forval j = 1/1750 {
replace individual_`j' = 1 if individual[_n+`j'] == 1 & year == 2003 & charity == "A"
}
qui forval j = 1/1750 {
replace individual_`j' = 1 if individual[_n+`j'] == 1 & year == 2004 & charity == "A"
}
qui forval j = 1/1750 {
replace individual_`j' = 1 if individual[_n+`j'] == 1 & year == 2005 & charity == "A"
}
I would then sum over each charity. The data are too numerous for this brute force to work, hopefully there is an easier way.
I am open to doing this outside of Stata.
Upvotes: 2
Views: 1204
Reputation: 2414
As an alternative, you might want to consider benchmarking the following. First, tabulate all triplets (entries will be 1 or 0 depending on whether an individual contributed to the charity in the year):
tbl <- table(dat$Individual, dat$Charity, dat$Year)
Now we want to loop through each Year (which is the third dimension of tbl
) and for each pair of rows (individuals), calculate the number of shared 1's. Achieved as follows:
res <- apply(tbl, 3, function(x) x %*% t(x))
dim(res) <- c(dim(tbl)[1], dim(tbl)[1], dim(tbl)[3])
Upvotes: 0
Reputation: 3525
I recently did something kind of similar. First add a column combining year and charity. Then convert the data frame into a list of charities per individual. I called your example of the data x
x$info <- paste(x$Year,x$Charity,sep="_")
All_Groups.list <- vector(length(unique(x$Individual)),mode="list")
names(All_Groups.list) <- as.character(unique(x$Individual))
for (i in 1:length(All_Groups.list)) {
All_Groups.list[i] <- list(c(as.character(x[x$Individual == names(All_Groups.list)[i],4])))
}
Self.Cor.table <- sapply(All_Groups.list, function(x) {
sapply(All_Groups.list,function(y){
length(x[x %in% y])
})
})
The output is a correlation table where the numbers count the overlap in attended events
> Self.Cor.table
1 2 17
1 1 1 1
2 1 3 2
17 1 2 2
This differs from your desired output by giving the number of events attended by each individual instead of a .
which I think is important because each individual attends a different number of events.
If you want it per year subset the data frame by year and repeat for each subset.
Upvotes: 3