Reputation: 2949

Compute similarity percentage OR Compute correlation between more than 2 objects

Consider I have four objects (a,b,c,d), and I ask five persons to label them (category 1 or 2) according to their physical appearance or something else. The labels provided by five persons for these objects are shown as

df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))

In tabular format,

 ---------
  a b c d
 ---------
  1 1 2 1
  2 2 1 2
  1 2 2 1
  2 1 2 2
  1 1 2 1
----------

Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)

I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in R which can calculate such statistics?

Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.

Upvotes: 8

Answers (6)

Peteris

Reputation: 3325

Clustering similarity metrics

It seems that you might want to calculate a substantially different (better?) metric than what you propose now, if your actual problem requires to evaluate various options of clustering the same data.

This http://cs.utsa.edu/~qitian/seminar/Spring11/03_11_11/IR2009.pdf is a good overview of the problem, but the BCubed precision/recall metrics are commonly used for similar problems in NLP (e.g http://alias-i.com/lingpipe/docs/api/com/aliasi/cluster/ClusterScore.html).

Upvotes: 1

alistaire

Reputation: 43334

There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. combn can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with prop.table, but here it's simpler to calculate directly.

Here I've used tidyverse grammar (primarily purrr, which is helpful for handling lists), but convert into base if you like.

library(tidyverse)

map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>%    # get combinations
    flatten() %>%    # eliminate nesting
    set_names(map_chr(., paste0, collapse = '')) %>%    # add useful names
    # subset df with combination, see if each row has only one unique value
    map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>% 
    map_dbl(~sum(.x) / length(.x))    # calculate TRUE proportion

##   ab   ac   ad   bc   bd   cd  abc  abd  acd  bcd abcd 
##  0.6  0.2  1.0  0.2  0.6  0.2  0.0  0.6  0.2  0.0  0.0

Upvotes: 4

Sandipan Dey

Reputation: 23101

Try this:

find.unanimous.percentage <- function(df, at.a.time) {
  cols <- as.data.frame(t(combn(names(df), at.a.time)))
  names(cols) <- paste('O', 1:at.a.time, sep='')
  cols$percent.unanimous <- 100*colMeans(apply(cols, 1, function(x) apply(df[x], 1, function(y) length(unique(y)) == 1)))
  return(cols)
}

find.unanimous.percentage(df, 2) # take 2 at a time

  O1 O2 percent.unanimous
1  a  b                60
2  a  c                20
3  a  d               100
4  b  c                20
5  b  d                60
6  c  d                20

find.unanimous.percentage(df, 3) # take 3 at a time

  O1 O2 O3 percent.unanimous
1  a  b  c                 0
2  a  b  d                60
3  a  c  d                20
4  b  c  d                 0

find.unanimous.percentage(df, 4)  

  O1 O2 O3 O4 percent.unanimous
1  a  b  c  d                 0

Upvotes: 2

Silence Dogood

Reputation: 3597

With base R functions you could do:

 groupVec = c("a","b","d")

 transDF = t(as.matrix(DF))

 subDF  = transDF[rownames(transDF) %in% groupVec,]
 subDF
   # [,1] [,2] [,3] [,4] [,5]
 # a    1    2    1    2    1
 # b    1    2    2    1    1
 # d    1    2    1    2    1

 #if length of unique values is 1, it implies match across all objects, count unique values/total columns = match pct
 match_pct = sum(sapply(as.data.frame(subDF), function(x)  sum(length(unique(x))==1) ))/ncol(subDF)
 match_pct
 # [1] 0.6

Wrapping it in a custom funtion:

 fn_matchPercent = function(groupVec =  c("a","d") ) {


 transDF = t(as.matrix(DF))

 subDF  = transDF[rownames(transDF) %in% groupVec,]

 match_pct = sum(sapply(as.data.frame(subDF), function(x)  sum(length(unique(x))==1) ))/ncol(subDF)


 outputDF = data.frame(groups = paste0(groupVec,collapse=",") ,match_pct = match_pct)

 return(outputDF)

 }

 fn_matchPercent(c("a","d"))
   # groups match_pct
 # 1    a,d         1
 fn_matchPercent(c("a","b","d"))
   # groups match_pct
 # 1  a,b,d       0.6

Upvotes: 2

thelatemail

Reputation: 93813

If you have numeric ratings, you could use diff to check if you consistently have 0 difference between each rater:

f <- function(cols, data) {
  sum(colSums(diff(t(data[cols]))==0)==(length(cols)-1)) / nrow(data)
}

Results are as expected when applying the function to example groups:

f(c("a","b","d"), df)
#[1] 0.6
f(c("a","d"), df)
#[1] 1

Upvotes: 4

Eugene Brown

Reputation: 4362

Try this code. It works for your example and should hold for the extended case.

df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))

# Find all unique combinations of the column names
group_pairs <- data.frame(t(combn(colnames(df), 2)))

# For each combination calculate the similarity
group_pairs$similarities <- apply(group_pairs, 1, function(x) {
  sum(df[x["X1"]] == df[x["X2"]])/nrow(df)
})

Upvotes: 0

Compute similarity percentage OR Compute correlation between more than 2 objects

Answers (6)

Clustering similarity metrics

Related Questions