PCK1992
PCK1992

Reputation: 213

Replacing values with 'NA' by ID in R

I have data that looks like this

ID    v1    v2
1     1     0
2     0     1
3     1     0
3     0     1
4     0     1

I want to replace all values with 'NA' if the ID occurs more than once in the dataframe. The final product should look like this

ID    v1    v2
1     1     0
2     0     1
3     NA    NA
3     NA    NA
4     0     1

I could do this by hand, but I want R to detect all the duplicate cases (in this case two times ID '3') and replace the values with 'NA'.

Thanks for your help!

Upvotes: 0

Views: 337

Answers (3)

RHertel
RHertel

Reputation: 23788

One more option:

df1[df1$ID %in% df1$ID[duplicated(df1$ID)], -1] <- NA
#> df1
#  ID v1 v2
#1  1  1  0
#2  2  0  1
#3  3 NA NA
#4  3 NA NA
#5  4  0  1

data

df1 <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L, 
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1", 
"v2"), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 3

Rich Scriven
Rich Scriven

Reputation: 99331

You could use duplicated() from either end, and then replace.

idx <- duplicated(df$ID) | duplicated(df$ID, fromLast = TRUE)
df[idx, -1] <- NA

which gives

  ID v1 v2
1  1  1  0
2  2  0  1
3  3 NA NA
4  3 NA NA
5  4  0  1

This will also work if the duplicated IDs are not next to each other.

Data:

df <- structure(list(ID = c(1L, 2L, 3L, 3L, 4L), v1 = c(1L, 0L, 1L, 
0L, 0L), v2 = c(0L, 1L, 0L, 1L, 1L)), .Names = c("ID", "v1", 
"v2"), class = "data.frame", row.names = c(NA, -5L))

Upvotes: 4

lmo
lmo

Reputation: 38500

Here is a base R method

# get list of repeated IDs
repeats <- rle(df$ID)$values[rle(df$ID)$lengths > 1]

# set the corresponding variables to NA
df[, -1] <- sapply(df[, -1], function(i) {i[df$ID %in% repeats] <- NA; i})

In the first line, we use rle to extract repeated IDs. In the second, we use sapply to loop through non-ID variables and replace IDs that repeat with NA for each variable.

Note that this assumes that the data set is sorted by ID. This may be accomplished with the order function. (df <- df[order(df$ID),]).

If the dataset is very large, you might break up the first function into two steps to avoid computing the rle twice:

dfRle <- rle(df$ID)
repeats <- dfRle$values[dfRle$lengths > 1]

data

df <- read.table(header=T, text="ID    v1    v2
1     1     0
2     0     1
3     1     0
3     0     1
4     0     1")

Upvotes: 0

Related Questions