Reputation: 11
I have a data set that consists of Sample IDs and a corresponding outcome variable. However, there are some duplicate Sample IDs in my data set. What I would like to do is identify the duplicate Sample ID and recode this Sample ID as a unique name and then recode the outcome variable as missing. I am aware that it would be easier to just remove the entire row but I need the maintain the number of rows.
So I have a data set like this:
dt<- data.frame(ID=c("A", "B", "A", "C"), Outcome=c("1", "1", "1", "1"))
And I would like to recode it so it looks like this:
dt1<- data.frame(ID=c("A", "B", "A.1", "C"), Outcome=c("1", "1", "-9", "1"))
Thanks!
Upvotes: 1
Views: 384
Reputation: 887501
The dataset columns were factor
class. I would use stringsAsFactors=FALSE
in the data.frame
call to create non-numeric columns with class 'character'. The reason is that if we are going to change some values/replace some levels in 'factor' class, we need to have that new 'value' as one of the 'levels' of that 'factor'. So to avoid that, I convert the already 'factor' columns to 'character'. In the example, both columns are 'factor' class. So we loop through the columns of 'dt' with lapply
and change the columns to 'character' (as.character
).
dt[] <- lapply(dt, as.character)
I guess the OP wanted to have unique
elements in the 'ID' column by replacing the duplicate
IDs. One option is make.unique
.
dt$ID <- make.unique(dt$ID)
After we convert the 'ID' to unique
IDs, we can check for .
in that column and replace the corresponding elements in the 'Outcome' column by -9.
dt$Outcome[grep('[.]', dt$ID)] <- -9
dt
# ID Outcome
#1 A 1
#2 B 1
#3 A.1 -9
#4 C 1
Or as @A.Webb mentioned in the comments, we can use duplicated
with ifelse
to change the 'Outcome' column values.
transform(dt,
ID=make.unique(as.character(ID)), #change the ID column
Outcome=ifelse(duplicated(ID),-9, Outcome)) #change Outcome
Upvotes: 3