Stephanie Ross
Stephanie Ross

Reputation: 11

Duplicate Sample IDs/Outcomes and recode as unique and missing using R

I have a data set that consists of Sample IDs and a corresponding outcome variable. However, there are some duplicate Sample IDs in my data set. What I would like to do is identify the duplicate Sample ID and recode this Sample ID as a unique name and then recode the outcome variable as missing. I am aware that it would be easier to just remove the entire row but I need the maintain the number of rows.

So I have a data set like this:

dt<- data.frame(ID=c("A", "B", "A", "C"), Outcome=c("1", "1", "1", "1"))

And I would like to recode it so it looks like this:

dt1<- data.frame(ID=c("A", "B", "A.1", "C"), Outcome=c("1", "1", "-9", "1"))

Thanks!

Upvotes: 1

Views: 384

Answers (1)

akrun
akrun

Reputation: 887501

The dataset columns were factor class. I would use stringsAsFactors=FALSE in the data.frame call to create non-numeric columns with class 'character'. The reason is that if we are going to change some values/replace some levels in 'factor' class, we need to have that new 'value' as one of the 'levels' of that 'factor'. So to avoid that, I convert the already 'factor' columns to 'character'. In the example, both columns are 'factor' class. So we loop through the columns of 'dt' with lapply and change the columns to 'character' (as.character).

 dt[] <- lapply(dt, as.character)

I guess the OP wanted to have unique elements in the 'ID' column by replacing the duplicate IDs. One option is make.unique.

 dt$ID <- make.unique(dt$ID)

After we convert the 'ID' to unique IDs, we can check for . in that column and replace the corresponding elements in the 'Outcome' column by -9.

 dt$Outcome[grep('[.]', dt$ID)] <- -9
 dt
 #   ID Outcome
 #1   A       1
 #2   B       1
 #3 A.1      -9
 #4   C       1

Or as @A.Webb mentioned in the comments, we can use duplicated with ifelse to change the 'Outcome' column values.

 transform(dt,
       ID=make.unique(as.character(ID)), #change the ID column
       Outcome=ifelse(duplicated(ID),-9, ​Outcome)) #change Outcome

Upvotes: 3

Related Questions