Reputation: 33
Is there a fast way to count the number of times a value that appears in one of multiple vectors also appears in multiple other vectors? Here's an example:
library(data.table)
names<-c(rep('apple',4),rep('banana',3),rep('cantalope',2),'date')
set.seed(38291)
v1<-data.table(municipality=rep('A',6),village=rep('1',6),
last=sample(names,6,replace=TRUE),
middle=sample(names,6,replace=TRUE),id=c(1:6))
v2<-data.table(municipality=rep('A',4),village=rep('2',4),
last=sample(names,4,replace=TRUE),
middle=sample(names,4,replace=TRUE),id=c(7:10))
v1
# municipality village last middle id
# 1: A 1 banana cantalope 1
# 2: A 1 cantalope banana 2
# 3: A 1 cantalope cantalope 3
# 4: A 1 apple apple 4
# 5: A 1 banana apple 5
# 6: A 1 apple apple 6
v2
# municipality village last middle id
# 1: A 2 date cantalope 7
# 2: A 2 apple date 8
# 3: A 2 cantalope banana 9
# 4: A 2 apple cantalope 10
DT = rbind(v1, v2)
I want to count the number of family ties that cross between individuals in village 1 and village 2. Cross-village family ties are defined by whether an individual's last OR middle name ('last' or 'middle') matches someone's last or middle name in another village. In this example, the individual with id=1 who lives in village 1 has three family members in village 2 (those with ids 7, 9, and 10) because he shares at least one name with them. I then want to create a dyadic dataset of villages where the ties between villages are defined by the number of family ties that cross between those villages. So, in this example the final dataset would look like:
dyads<-data.table(v1='1',v2='2',ties=3+3+3+2+3+2)
dyads
v1 v2 ties
1: 1 2 16
Is there an efficient way to calculate this 'ties' number? I've written an inefficient for loop to do this, but I have a massive dataset (~ 50 million individuals in 40000 villages).
Upvotes: 3
Views: 329
Reputation: 34703
An update inspired by Frank's answer:
meltDT =
#use unique to eliminate last+middle duplication
unique(melt(DT, measure.vars = c('last', 'middle'),
id.vars = c('village', 'id'), value.name = 'name'),
by = c('village', 'id', 'name'))
#framework of output -- one row for each pair of villages
out.dt = with(DT, CJ(village, village, unique = TRUE))[V2 > V1]
setkey(meltDT, village)
setindex(meltDT, name)
#set indices to facilitate merges on names
out.dt[ , {
ties :=
#unique here eliminates matching on both last & middle
uniqueN(meltDT[.(.BY$V1)][meltDT[.(.BY$V2)], on = 'name',
allow.cartesian = TRUE, nomatch = 0L],
by = c('id', 'i.id'))
}, by = .(V1, V2)]
out.dt
# V1 V2 ties
# 1: 1 2 16
Upvotes: 6
Reputation: 66819
This extends to 3+ villages, but will be quite slow:
DT = rbind(v1, v2)
matches = melt(DT, id="id", measure.vars=c("middle","last"))[,
CJ(id1 = id, id2 = id)[id1 < id2]
, by=value]
matches[DT, on=.(id1 = id), v1 := i.village ]
matches[DT, on=.(id2 = id), v2 := i.village ]
unique(matches[, !"value"])[v2 != v1, .N, by=.(v1, v2)]
# v1 v2 N
# 1: 1 2 16
So it finds individuals who match (even if they're in the same village) and the OP's desired result is just a summary stat computed with this set of matches.
Upvotes: 3