Reputation: 115
I've got a long list of authors and words something like
author1,word1
author1,word2
author1,word3
author2,word2
author3,word1
The actual list has hundreds of authors and thousands of words. It exists as a CSV file which I have read into a dataframe and de-duplicated like
> typeof(x)
[1] "list"
> colnames(x)
[1] "author" "word"
The last bit of dput(head(x)) looks like
), class = "factor")), .Names = c("author", "word"), row.names = c(NA,
6L), class = "data.frame")
What I'm trying to do is calculate how similar the word lists are between authors based on intersection of the author's wordlists as a percentage of one authors total vocabulary. (I'm sure there are proper terms for what I'm doing but I don't quite know what they are.)
In python or perl I would group all the words by author and use nested loops to compare everyone with everyone else but I'm wondering how I would do that in R? I have a feeling that "use apply" is going to be the answer- if it is can you please explain it in small words for newbies like me?
Upvotes: 0
Views: 222
Reputation: 35314
Here's one way to do it using data.table:
## 1: generate test data
set.seed(1L);
wordList <- paste0('word',1:5);
authorList <- paste0('author',1:5);
rs <- sample(1:5,length(authorList),replace=T);
aw <- data.table(
author=factor(rep(authorList,rs)),
word=factor(do.call(c,lapply(rs,function(r) sort(sample(wordList,r))))),
key='author'
);
aw;
## author word
## 1: author1 word4
## 2: author1 word5
## 3: author2 word3
## 4: author2 word4
## 5: author3 word1
## 6: author3 word4
## 7: author3 word5
## 8: author4 word1
## 9: author4 word2
## 10: author4 word3
## 11: author4 word4
## 12: author4 word5
## 13: author5 word2
## 14: author5 word5
## 2: initialize intersection table with unique combinations of authors
ai <- aw[,setkey(setNames(nm=c('a1','a2'),as.data.table(t(combn(unique(author),2L)))))];
## 3: compute word intersection size for each combination of authors
ai[,int:=length(intersect(aw[a1,word],aw[a2,word])),key(ai)];
## a1 a2 int
## 1: author1 author2 1
## 2: author1 author3 2
## 3: author1 author4 2
## 4: author1 author5 1
## 5: author2 author3 1
## 6: author2 author4 2
## 7: author2 author5 0
## 8: author3 author4 3
## 9: author3 author5 1
## 10: author4 author5 2
## 4: compute percentages
ai[,`:=`(p1=int/aw[a1,.N],p2=int/aw[a2,.N]),key(ai)];
## a1 a2 int p1 p2
## 1: author1 author2 1 0.5000000 0.5000000
## 2: author1 author3 2 1.0000000 0.6666667
## 3: author1 author4 2 1.0000000 0.4000000
## 4: author1 author5 1 0.5000000 0.5000000
## 5: author2 author3 1 0.5000000 0.3333333
## 6: author2 author4 2 1.0000000 0.4000000
## 7: author2 author5 0 0.0000000 0.0000000
## 8: author3 author4 3 1.0000000 0.6000000
## 9: author3 author5 1 0.3333333 0.5000000
## 10: author4 author5 2 0.4000000 1.0000000
Upvotes: 2