Dotan Haim
Dotan Haim

Reputation: 33

R (data.table): Fast counts of value matches in multiple column

Is there a fast way to count the number of times a value that appears in one of multiple vectors also appears in multiple other vectors? Here's an example:

library(data.table)
names<-c(rep('apple',4),rep('banana',3),rep('cantalope',2),'date')
set.seed(38291)
v1<-data.table(municipality=rep('A',6),village=rep('1',6),
               last=sample(names,6,replace=TRUE),
               middle=sample(names,6,replace=TRUE),id=c(1:6))
v2<-data.table(municipality=rep('A',4),village=rep('2',4),
               last=sample(names,4,replace=TRUE),
               middle=sample(names,4,replace=TRUE),id=c(7:10))
v1
#    municipality village      last    middle id
# 1:            A       1    banana cantalope  1
# 2:            A       1 cantalope    banana  2
# 3:            A       1 cantalope cantalope  3
# 4:            A       1     apple     apple  4
# 5:            A       1    banana     apple  5
# 6:            A       1     apple     apple  6
v2
#    municipality village      last    middle id
# 1:            A       2      date cantalope  7
# 2:            A       2     apple      date  8
# 3:            A       2 cantalope    banana  9
# 4:            A       2     apple cantalope 10
DT = rbind(v1, v2)

I want to count the number of family ties that cross between individuals in village 1 and village 2. Cross-village family ties are defined by whether an individual's last OR middle name ('last' or 'middle') matches someone's last or middle name in another village. In this example, the individual with id=1 who lives in village 1 has three family members in village 2 (those with ids 7, 9, and 10) because he shares at least one name with them. I then want to create a dyadic dataset of villages where the ties between villages are defined by the number of family ties that cross between those villages. So, in this example the final dataset would look like:

dyads<-data.table(v1='1',v2='2',ties=3+3+3+2+3+2)
dyads
   v1 v2 ties
1:  1  2   16

Is there an efficient way to calculate this 'ties' number? I've written an inefficient for loop to do this, but I have a massive dataset (~ 50 million individuals in 40000 villages).

Upvotes: 3

Views: 329

Answers (2)

MichaelChirico
MichaelChirico

Reputation: 34703

An update inspired by Frank's answer:

meltDT = 
  #use unique to eliminate last+middle duplication
  unique(melt(DT, measure.vars = c('last', 'middle'), 
              id.vars = c('village', 'id'), value.name = 'name'),
         by = c('village', 'id', 'name'))

#framework of output -- one row for each pair of villages
out.dt = with(DT, CJ(village, village, unique = TRUE))[V2 > V1]

setkey(meltDT, village)
setindex(meltDT, name)
#set indices to facilitate merges on names
out.dt[ , {
  ties := 
    #unique here eliminates matching on both last & middle
    uniqueN(meltDT[.(.BY$V1)][meltDT[.(.BY$V2)], on = 'name', 
                              allow.cartesian = TRUE, nomatch = 0L],
            by = c('id', 'i.id'))
}, by = .(V1, V2)]
out.dt
#    V1 V2 ties
# 1:  1  2   16

Upvotes: 6

Frank
Frank

Reputation: 66819

This extends to 3+ villages, but will be quite slow:

DT = rbind(v1, v2)

matches = melt(DT, id="id", measure.vars=c("middle","last"))[, 
  CJ(id1 = id, id2 = id)[id1 < id2]
, by=value]

matches[DT, on=.(id1 = id), v1 := i.village ]
matches[DT, on=.(id2 = id), v2 := i.village ]

unique(matches[, !"value"])[v2 != v1, .N, by=.(v1, v2)]
#    v1 v2  N
# 1:  1  2 16

So it finds individuals who match (even if they're in the same village) and the OP's desired result is just a summary stat computed with this set of matches.

Upvotes: 3

Related Questions