Shiva
Shiva

Reputation: 799

Looking for same values in two different columns in data.table

I have the following data

dt<-data.table(id=c(1,1,1,1,1),Claimed=c("i","j","i","j","k"),Detected=c("k","j","i","j","i"))
> dt
   id Claimed Detected
1:  1       i        k
2:  1       j        j
3:  1       i        i
4:  1       j        j
5:  1       k        i

Now, for each element in "Claimed" column, I want to check if it is same in "Detected" column and get the percentage of similarity. So, ideally my output would be the following for the above data table.

> dt
   id Claimed   Percentage
1:  1       i           50 
2:  1       j          100
3:  1       k            0

I tried to reshape the data and get all the "Detected" variables for each unique element in "Claimed" like below and count the occurrences of "Claimed" variable in new "Detected_modified" column and thus getting the percentage.

> dt
   id Claimed Detected_modified
1:  1       i               k,i
2:  1       j               j,j
3:  1       k                 i

But, I strongly believe there would be a simpler solution using data.table which I am not able to figure out in this situation. Can someone please have a try at it ?

EDIT :

The levels of factors in the Detected and Claimed columns are different.

Upvotes: 1

Views: 96

Answers (1)

jeremycg
jeremycg

Reputation: 24945

Using dplyr:

library(dplyr)

dt %>% group_by(id, Claimed) %>%
       summarise(percentage = 100 * sum(Claimed == Detected) / n( ))

Edit: if you cant compare due to differing factor levels:

dt %>% group_by(id, Claimed) %>%
       summarise(percentage = 100 * sum(as.character(Claimed) == as.character(Detected)) / n( ))

Upvotes: 2

Related Questions