trillian
trillian

Reputation: 49

Function to compare rows among themselves of a specific column with the same ID

I have a big laboratory database, some IDs have multiple results, I also created another key variable with initials+age+sex variables for other matching purposes with the medical records of the hospital. However I noticed that sometimes different initials have the same hospital ID. I want to write a function to detect this incoherence.

So the example of the data base:

df=data.frame(ID=c("5606","5606","5728","5824","5824","5824","5824"),
                     key2=c("TN35M","TN35M","JJ26M","CD47F","CD47F","DG44M","DG44M"),
              date_sample=c("12/03/2012","12/03/2012","19/04/2012","21/05/2012","21/05/2012","19/10/2012","19/10/2012"), service=c("ORTHO","ORTHO","BLOC","VISC","VISC","BLOC","BLOC"), germe=c("Acinetobacter sp","Burkholderia pseudomallei","Stenotrophomonas maltophilia","Staphylococcus haemolyticus"," Enterobacter cloacae","Escherichia  coli","Pseudomonas aeruginosa"))


ID      key2    date_sample service germe

5606    TN35M   12/03/2012  ORTHO   Acinetobacter sp
5606    TN35M   12/03/2012  ORTHO   Burkholderia pseudomallei
5728    JJ26M   19/04/2012  BLOC    Stenotrophomonas maltophilia
5824    CD47F   21/05/2012  VISC    Staphylococcus haemolyticus
5824    CD47F   21/05/2012  VISC    Enterobacter cloacae
5824    DG44M   19/10/2012  BLOC    Escherichia coli
5824    DG44M   19/10/2012  BLOC    Pseudomonas aeruginosa

Each ID supposed to have one unique key2 variable. How can I compare rows of "key2" variable for the same "ID" variable and have an output variable to detect me all the lines with incoherence, in order to be sure that each ID given to one unique patient but not shared by more than 1 patient?

like:


ID       key2   date_sample service germe                        incoherence

5606    TN35M   12/03/2012  ORTHO   Acinetobacter sp                N
5606    TN35M   12/03/2012  ORTHO   Burkholderia pseudomallei       N
5728    JJ26M   19/04/2012  BLOC    Stenotrophomonas maltophilia    N
5824    CD47F   21/05/2012  VISC    Staphylococcus haemolyticus     Y
5824    CD47F   21/05/2012  VISC    Enterobacter cloacae            Y
5824    DG44M   19/10/2012  BLOC    Escherichia coli                Y
5824    DG44M   19/10/2012  BLOC    Pseudomonas aeruginosa          Y

Upvotes: 0

Views: 90

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388947

Using dplyr

library(dplyr)
df %>%
 group_by(ID) %>%
 mutate(incoherence = c("N", "Y")[(n_distinct(key2) > 1) +1])

#   ID    key2 incoherence
#  <fct> <fct> <chr>      
#1 5606  TN35M N          
#2 5606  TN35M N          
#3 5728  JJ26M N          
#4 5824  CD47F Y          
#5 5824  CD47F Y          
#6 5824  DG44M Y          
#7 5824  DG44M Y       

and data.table

library(data.table)
setDT(df)[, incoherence := c("N", "Y")[(uniqueN(key2) > 1) +1], by = ID]

Upvotes: 1

Sotos
Sotos

Reputation: 51582

You can count the unique values of each group. If more than 1 then Y (or in this case TRUE), i.e.

!with(df, ave(key2, ID, FUN = function(i) length(unique(i)))) == 1
#[1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

NOTE: Make sure your variables are characters, not factors

Upvotes: 1

Related Questions