Reputation: 799
I have the following data
dt<-data.table(id=c(1,1,1,1,1),Claimed=c("i","j","i","j","k"),Detected=c("k","j","i","j","i"))
> dt
id Claimed Detected
1: 1 i k
2: 1 j j
3: 1 i i
4: 1 j j
5: 1 k i
Now, for each element in "Claimed"
column, I want to check if it is same in "Detected"
column and get the percentage of similarity. So, ideally my output would be the following for the above data table.
> dt
id Claimed Percentage
1: 1 i 50
2: 1 j 100
3: 1 k 0
I tried to reshape the data and get all the "Detected"
variables for each unique element in "Claimed"
like below and count the occurrences of "Claimed"
variable in new "Detected_modified"
column and thus getting the percentage.
> dt
id Claimed Detected_modified
1: 1 i k,i
2: 1 j j,j
3: 1 k i
But, I strongly believe there would be a simpler solution using data.table which I am not able to figure out in this situation. Can someone please have a try at it ?
EDIT :
The levels of factors in the Detected
and Claimed
columns are different.
Upvotes: 1
Views: 96
Reputation: 24945
Using dplyr
:
library(dplyr)
dt %>% group_by(id, Claimed) %>%
summarise(percentage = 100 * sum(Claimed == Detected) / n( ))
Edit: if you cant compare due to differing factor levels:
dt %>% group_by(id, Claimed) %>%
summarise(percentage = 100 * sum(as.character(Claimed) == as.character(Detected)) / n( ))
Upvotes: 2