Apricot
Apricot

Reputation: 3021

Set value of a column to NA based on conditions in R

I have a data frame, a reproducible example is as follows:

structure(list(subscriberid = c(1177460837L, 1177460837L, 1177460837L, 
1146526049L, 1146526049L, 1146526049L), variable = c("3134", 
"4550", "4550", "5160", "2530", "2530"), value = c(1, 2, 2, 1, 
2, 2), gender = c(2, 2, 2, 1, 2, 2), cwe = c(NA, 50L, 50L, NA, 
30L, 30L), hw = c(NA, 48L, 48L, NA, 26L, 26L), resp = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), cna = c(3L, 1L, 1L, 3L, 1L, 1L)), .Names = c("subscriberid", 
"variable", "value", "gender", "cwe", "hw", "resp", "cna"), row.names = c(4L, 
5L, 6L, 9L, 10L, 11L), class = "data.frame")

The actual data frame looks like this:

   subscriberid variable value gender cwe hw resp cna
4    1177460837     3134     1      2  NA NA   NA   3
5    1177460837     4550     2      2  50 48   NA   1
6    1177460837     4550     2      2  50 48   NA   1
9    1146526049     5160     1      1  NA NA   NA   3
10   1146526049     2530     2      2  30 26   NA   1
11   1146526049     2530     2      2  30 26   NA   1

In the above df, row 5 and 6 are exactly the same. From row 5, I want to remove 48 and row 6 I want to remove 50. Essentially, I want to retain only one age in a row and set the other to NA. I tried using a for loop but that sets column values in the column that I refer in both the rows to NA.

for (i in 1:nrow(test)) {
  test$hw[i] <- ifelse(!is.na(test$cwe[i]) & !is.na(test$hw[i]), NA, test$hw[i])
}

I am trying to set an if condition to identify if both the rows are same, then I want to iteratively remove one of the values from the first row and remove the other from the second.

The desired output is as follows:

  subscriberid variable value gender cwe hw resp cna
4    1177460837     3134     1      2  NA NA   NA   3
5    1177460837     4550     2      2  50 NA   NA   1
6    1177460837     4550     2      2  NA 48   NA   1
9    1146526049     5160     1      1  NA NA   NA   3
10   1146526049     2530     2      2  30 NA   NA   1
11   1146526049     2530     2      2  NA 26   NA   1

Upvotes: 1

Views: 3006

Answers (4)

Shique
Shique

Reputation: 744

You can use a combination of which() and duplicated() to receive duplicated rows. Because you need to change values twice of the rows, you have to create a copy of the dataframe. Note that this will only work if the identical rows are always consecutive.

dfNA <- df
dfNA$hw[which(duplicated(df))-1] <- NA
dfNA$cwe[which(duplicated(df))] <- NA

dfNA
#   subscriberid variable value gender cwe hw resp cna
#4    1177460837     3134     1      2  NA NA   NA   3
#5    1177460837     4550     2      2  50 NA   NA   1
#6    1177460837     4550     2      2  NA 48   NA   1
#9    1146526049     5160     1      1  NA NA   NA   3
#10   1146526049     2530     2      2  30 NA   NA   1
#11   1146526049     2530     2      2  NA 26   NA   1

Upvotes: 3

digEmAll
digEmAll

Reputation: 57220

A possible solution :

# create a logical vector indicating if current row is identical to previous one
# N.B.: do.call("paste",c(DF,sep="\r")) is used internally by "duplicated.data.frame" function
rowStrings <- do.call("paste", c(DF, sep = "\r"))
currRowIsEqualToPrev <- rowStrings[-1] == rowStrings[-length(rowStrings)]

# set first row hw = NA and second identical row cwe = NA
DF[c(FALSE,currRowIsEqualToPrev),'hw'] <- NA
DF[c(currRowIsEqualToPrev,FALSE),'cwe'] <- NA

> DF
   subscriberid variable value gender cwe hw resp cna
4    1177460837     3134     1      2  NA NA   NA   3
5    1177460837     4550     2      2  NA 48   NA   1
6    1177460837     4550     2      2  50 NA   NA   1
9    1146526049     5160     1      1  NA NA   NA   3
10   1146526049     2530     2      2  NA 26   NA   1
11   1146526049     2530     2      2  30 NA   NA   1

Upvotes: 2

Marcus Campbell
Marcus Campbell

Reputation: 2796

I took a shot at it. This relies on using group_by from dplyr to find duplicate rows. This method assumes that rows can be reliably be identified as identical by using the subscriberid, variable, value, gender, resp, and cna columns alone.

Because it is operating within groups only, it will work even if a preceding non-identical row contains the same value for cwe (I did check this, but I would also confirm it for yourself if I were you).

library(dplyr)

ndf <- df %>%
       group_by(subscriberid, variable, value, gender, resp, cna) %>%
       mutate(cwe = na_if(cwe, lag(cwe)),
              hw = na_if(hw, lead(hw))) %>%
       ungroup()

Output:

# A tibble: 6 x 8
  subscriberid variable value gender   cwe    hw  resp   cna
         <int> <chr>    <dbl>  <dbl> <int> <int> <int> <int>
1   1177460837 3134        1.     2.    NA    NA    NA     3
2   1177460837 4550        2.     2.    50    NA    NA     1
3   1177460837 4550        2.     2.    NA    48    NA     1
4   1146526049 5160        1.     1.    NA    NA    NA     3
5   1146526049 2530        2.     2.    30    NA    NA     1
6   1146526049 2530        2.     2.    NA    26    NA     1

Upvotes: 1

zx8754
zx8754

Reputation: 56249

Using lead and lag from dplyr package:

library(dplyr)

df1 %>% 
  group_by(subscriberid, variable) %>% 
  mutate(cwe = if_else(lead(cwe) == cwe, cwe, NA_integer_),
         hw = if_else(lag(hw) == hw, hw, NA_integer_)) %>% 
  ungroup()

# # A tibble: 6 x 8
#   subscriberid variable value gender   cwe    hw resp    cna
#          <int>    <int> <int>  <int> <int> <int> <lgl> <int>
# 1   1177460837     3134     1      2    NA    NA NA        3
# 2   1177460837     4550     2      2    50    NA NA        1
# 3   1177460837     4550     2      2    NA    48 NA        1
# 4   1146526049     5160     1      1    NA    NA NA        3
# 5   1146526049     2530     2      2    30    NA NA        1
# 6   1146526049     2530     2      2    NA    26 NA        1

Upvotes: 2

Related Questions