Calculating intersection of lots of sets in R

Question

I've got a long list of authors and words something like

author1,word1
author1,word2
author1,word3
author2,word2
author3,word1

The actual list has hundreds of authors and thousands of words. It exists as a CSV file which I have read into a dataframe and de-duplicated like

    > typeof(x)
    [1] "list"
    > colnames(x)
    [1] "author"   "word"

The last bit of dput(head(x)) looks like

    ), class = "factor")), .Names = c("author", "word"), row.names = c(NA, 
    6L), class = "data.frame")

What I'm trying to do is calculate how similar the word lists are between authors based on intersection of the author's wordlists as a percentage of one authors total vocabulary. (I'm sure there are proper terms for what I'm doing but I don't quite know what they are.)

In python or perl I would group all the words by author and use nested loops to compare everyone with everyone else but I'm wondering how I would do that in R? I have a feeling that "use apply" is going to be the answer- if it is can you please explain it in small words for newbies like me?

bgoldst · Accepted Answer

Here's one way to do it using data.table:

## 1: generate test data
set.seed(1L);
wordList <- paste0('word',1:5);
authorList <- paste0('author',1:5);
rs <- sample(1:5,length(authorList),replace=T);
aw <- data.table(
    author=factor(rep(authorList,rs)),
    word=factor(do.call(c,lapply(rs,function(r) sort(sample(wordList,r))))),
    key='author'
);
aw;
##      author  word
##  1: author1 word4
##  2: author1 word5
##  3: author2 word3
##  4: author2 word4
##  5: author3 word1
##  6: author3 word4
##  7: author3 word5
##  8: author4 word1
##  9: author4 word2
## 10: author4 word3
## 11: author4 word4
## 12: author4 word5
## 13: author5 word2
## 14: author5 word5

## 2: initialize intersection table with unique combinations of authors
ai <- aw[,setkey(setNames(nm=c('a1','a2'),as.data.table(t(combn(unique(author),2L)))))];

## 3: compute word intersection size for each combination of authors
ai[,int:=length(intersect(aw[a1,word],aw[a2,word])),key(ai)];
##          a1      a2 int
##  1: author1 author2   1
##  2: author1 author3   2
##  3: author1 author4   2
##  4: author1 author5   1
##  5: author2 author3   1
##  6: author2 author4   2
##  7: author2 author5   0
##  8: author3 author4   3
##  9: author3 author5   1
## 10: author4 author5   2

## 4: compute percentages
ai[,`:=`(p1=int/aw[a1,.N],p2=int/aw[a2,.N]),key(ai)];
##          a1      a2 int        p1        p2
##  1: author1 author2   1 0.5000000 0.5000000
##  2: author1 author3   2 1.0000000 0.6666667
##  3: author1 author4   2 1.0000000 0.4000000
##  4: author1 author5   1 0.5000000 0.5000000
##  5: author2 author3   1 0.5000000 0.3333333
##  6: author2 author4   2 1.0000000 0.4000000
##  7: author2 author5   0 0.0000000 0.0000000
##  8: author3 author4   3 1.0000000 0.6000000
##  9: author3 author5   1 0.3333333 0.5000000
## 10: author4 author5   2 0.4000000 1.0000000

Calculating intersection of lots of sets in R

Answers (1)

Related Questions