Reputation: 3021
I have a data frame, a reproducible example is as follows:
structure(list(subscriberid = c(1177460837L, 1177460837L, 1177460837L,
1146526049L, 1146526049L, 1146526049L), variable = c("3134",
"4550", "4550", "5160", "2530", "2530"), value = c(1, 2, 2, 1,
2, 2), gender = c(2, 2, 2, 1, 2, 2), cwe = c(NA, 50L, 50L, NA,
30L, 30L), hw = c(NA, 48L, 48L, NA, 26L, 26L), resp = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), cna = c(3L, 1L, 1L, 3L, 1L, 1L)), .Names = c("subscriberid",
"variable", "value", "gender", "cwe", "hw", "resp", "cna"), row.names = c(4L,
5L, 6L, 9L, 10L, 11L), class = "data.frame")
The actual data frame looks like this:
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 50 48 NA 1
6 1177460837 4550 2 2 50 48 NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 30 26 NA 1
11 1146526049 2530 2 2 30 26 NA 1
In the above df, row 5 and 6 are exactly the same. From row 5, I want to remove 48 and row 6 I want to remove 50. Essentially, I want to retain only one age in a row and set the other to NA. I tried using a for loop but that sets column values in the column that I refer in both the rows to NA.
for (i in 1:nrow(test)) {
test$hw[i] <- ifelse(!is.na(test$cwe[i]) & !is.na(test$hw[i]), NA, test$hw[i])
}
I am trying to set an if condition to identify if both the rows are same, then I want to iteratively remove one of the values from the first row and remove the other from the second.
The desired output is as follows:
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 50 NA NA 1
6 1177460837 4550 2 2 NA 48 NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 30 NA NA 1
11 1146526049 2530 2 2 NA 26 NA 1
Upvotes: 1
Views: 3006
Reputation: 744
You can use a combination of which()
and duplicated()
to receive duplicated rows.
Because you need to change values twice of the rows, you have to create a copy of the dataframe. Note that this will only work if the identical rows are always consecutive.
dfNA <- df
dfNA$hw[which(duplicated(df))-1] <- NA
dfNA$cwe[which(duplicated(df))] <- NA
dfNA
# subscriberid variable value gender cwe hw resp cna
#4 1177460837 3134 1 2 NA NA NA 3
#5 1177460837 4550 2 2 50 NA NA 1
#6 1177460837 4550 2 2 NA 48 NA 1
#9 1146526049 5160 1 1 NA NA NA 3
#10 1146526049 2530 2 2 30 NA NA 1
#11 1146526049 2530 2 2 NA 26 NA 1
Upvotes: 3
Reputation: 57220
A possible solution :
# create a logical vector indicating if current row is identical to previous one
# N.B.: do.call("paste",c(DF,sep="\r")) is used internally by "duplicated.data.frame" function
rowStrings <- do.call("paste", c(DF, sep = "\r"))
currRowIsEqualToPrev <- rowStrings[-1] == rowStrings[-length(rowStrings)]
# set first row hw = NA and second identical row cwe = NA
DF[c(FALSE,currRowIsEqualToPrev),'hw'] <- NA
DF[c(currRowIsEqualToPrev,FALSE),'cwe'] <- NA
> DF
subscriberid variable value gender cwe hw resp cna
4 1177460837 3134 1 2 NA NA NA 3
5 1177460837 4550 2 2 NA 48 NA 1
6 1177460837 4550 2 2 50 NA NA 1
9 1146526049 5160 1 1 NA NA NA 3
10 1146526049 2530 2 2 NA 26 NA 1
11 1146526049 2530 2 2 30 NA NA 1
Upvotes: 2
Reputation: 2796
I took a shot at it. This relies on using group_by
from dplyr
to find duplicate rows. This method assumes that rows can be reliably be identified as identical by using the subscriberid
, variable
, value
, gender
, resp
, and cna
columns alone.
Because it is operating within groups only, it will work even if a preceding non-identical row contains the same value for cwe
(I did check this, but I would also confirm it for yourself if I were you).
library(dplyr)
ndf <- df %>%
group_by(subscriberid, variable, value, gender, resp, cna) %>%
mutate(cwe = na_if(cwe, lag(cwe)),
hw = na_if(hw, lead(hw))) %>%
ungroup()
Output:
# A tibble: 6 x 8
subscriberid variable value gender cwe hw resp cna
<int> <chr> <dbl> <dbl> <int> <int> <int> <int>
1 1177460837 3134 1. 2. NA NA NA 3
2 1177460837 4550 2. 2. 50 NA NA 1
3 1177460837 4550 2. 2. NA 48 NA 1
4 1146526049 5160 1. 1. NA NA NA 3
5 1146526049 2530 2. 2. 30 NA NA 1
6 1146526049 2530 2. 2. NA 26 NA 1
Upvotes: 1
Reputation: 56249
Using lead and lag from dplyr package:
library(dplyr)
df1 %>%
group_by(subscriberid, variable) %>%
mutate(cwe = if_else(lead(cwe) == cwe, cwe, NA_integer_),
hw = if_else(lag(hw) == hw, hw, NA_integer_)) %>%
ungroup()
# # A tibble: 6 x 8
# subscriberid variable value gender cwe hw resp cna
# <int> <int> <int> <int> <int> <int> <lgl> <int>
# 1 1177460837 3134 1 2 NA NA NA 3
# 2 1177460837 4550 2 2 50 NA NA 1
# 3 1177460837 4550 2 2 NA 48 NA 1
# 4 1146526049 5160 1 1 NA NA NA 3
# 5 1146526049 2530 2 2 30 NA NA 1
# 6 1146526049 2530 2 2 NA 26 NA 1
Upvotes: 2