Reputation: 49
I have a big laboratory database, some IDs have multiple results, I also created another key variable with initials+age+sex variables for other matching purposes with the medical records of the hospital. However I noticed that sometimes different initials have the same hospital ID. I want to write a function to detect this incoherence.
So the example of the data base:
df=data.frame(ID=c("5606","5606","5728","5824","5824","5824","5824"),
key2=c("TN35M","TN35M","JJ26M","CD47F","CD47F","DG44M","DG44M"),
date_sample=c("12/03/2012","12/03/2012","19/04/2012","21/05/2012","21/05/2012","19/10/2012","19/10/2012"), service=c("ORTHO","ORTHO","BLOC","VISC","VISC","BLOC","BLOC"), germe=c("Acinetobacter sp","Burkholderia pseudomallei","Stenotrophomonas maltophilia","Staphylococcus haemolyticus"," Enterobacter cloacae","Escherichia coli","Pseudomonas aeruginosa"))
ID key2 date_sample service germe
5606 TN35M 12/03/2012 ORTHO Acinetobacter sp
5606 TN35M 12/03/2012 ORTHO Burkholderia pseudomallei
5728 JJ26M 19/04/2012 BLOC Stenotrophomonas maltophilia
5824 CD47F 21/05/2012 VISC Staphylococcus haemolyticus
5824 CD47F 21/05/2012 VISC Enterobacter cloacae
5824 DG44M 19/10/2012 BLOC Escherichia coli
5824 DG44M 19/10/2012 BLOC Pseudomonas aeruginosa
Each ID supposed to have one unique key2 variable. How can I compare rows of "key2" variable for the same "ID" variable and have an output variable to detect me all the lines with incoherence, in order to be sure that each ID given to one unique patient but not shared by more than 1 patient?
like:
ID key2 date_sample service germe incoherence
5606 TN35M 12/03/2012 ORTHO Acinetobacter sp N
5606 TN35M 12/03/2012 ORTHO Burkholderia pseudomallei N
5728 JJ26M 19/04/2012 BLOC Stenotrophomonas maltophilia N
5824 CD47F 21/05/2012 VISC Staphylococcus haemolyticus Y
5824 CD47F 21/05/2012 VISC Enterobacter cloacae Y
5824 DG44M 19/10/2012 BLOC Escherichia coli Y
5824 DG44M 19/10/2012 BLOC Pseudomonas aeruginosa Y
Upvotes: 0
Views: 90
Reputation: 388947
Using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(incoherence = c("N", "Y")[(n_distinct(key2) > 1) +1])
# ID key2 incoherence
# <fct> <fct> <chr>
#1 5606 TN35M N
#2 5606 TN35M N
#3 5728 JJ26M N
#4 5824 CD47F Y
#5 5824 CD47F Y
#6 5824 DG44M Y
#7 5824 DG44M Y
and data.table
library(data.table)
setDT(df)[, incoherence := c("N", "Y")[(uniqueN(key2) > 1) +1], by = ID]
Upvotes: 1
Reputation: 51582
You can count the unique values of each group. If more than 1 then Y
(or in this case TRUE
), i.e.
!with(df, ave(key2, ID, FUN = function(i) length(unique(i)))) == 1
#[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
NOTE: Make sure your variables are characters, not factors
Upvotes: 1