Reputation: 13
I am new to R but am excited to learn it and I thought this might be a good opportunity. I have two measurements of salinity (uS and mS.m_1.5). I have created 3 classes (1, 2, 3) for each measurement type (uSClass and mS.m_1.5Class) based on their values. For many of the observations, I only have 1 measurement type. I want to create a new class (SClass) based on these two classes.
Any observation of uSClass = 1 and mS.m_1.5Class = 1, should be SClass 1.
Any observation of uSClass = 1 and mS.m_1.5Class = NA, should be SClass 1.
Any observation of uSClass = NA and mS.m_1.5Class = 1, should be SClass 1. etc...
Any observation with conflicting classes (ex. uSClass = 1 and mS.m_1.5Class = 2) should not be assigned a class (NA). This is my code:
std$SClass <- ifelse(std$uSClass == 1 & std$mS.m_1.5Class == 1, 1,
ifelse(std$uSClass == 1 & is.na(std$mS.m_1.5Class), 1,
ifelse(is.na(std$uSClass) & std$mS.m_1.5Class == 1, 1,
ifelse(std$uSClass == 2 & std$mS.m_1.5Class == 2, 2,
ifelse(std$uSClass == 2 & is.na(std$mS.m_1.5Class), 2,
ifelse(is.na(std$uSClass) & std$mS.m_1.5Class == 2, 2,
ifelse(std$uSClass == 3 & std$mS.m_1.5Class == 3, 3,
ifelse(std$uSClass == 3 & is.na(std$mS.m_1.5Class), 3,
ifelse(is.na(std$uSClass) & std$mS.m_1.5Class == 3, 3, NA)))))))))
It makes logical sense to me but it must not be correct. The only classifications that work are those where both uSClass and mS.m_1.5Class have values. If I run the entire code, most observations are assigned NA. I have tried a couple other methods incorporating | operators but those have not worked either. Your help is appreciated!
Upvotes: 1
Views: 9009
Reputation: 145775
The rowMeans
approach works well in this case and will be very difficult to beat speed-wise. For a more general approach, most of what you're doing is finding the non-missing values in a series of columns. This is commonly called a "coalesce", and it it built-in to the dplyr
package (among others).
If you didn't have mismatches, then your operation could be simplified to this (using Pierre's nicely shared data):
with(mydata, dplyr::coalesce(Var1, Var2))
# Var1 Var2 r
# 1 1 1 1
# 4 NA 1 1
# 6 2 2 2
# 8 NA 2 2
# 11 3 3 3
# 12 NA 3 3
# 13 1 NA 1
# 14 2 NA 2
# 15 3 NA 3
# 16 NA NA NA
With mismatches, we need to check for those separately:
std$r = with(std, ifelse(Var1 != Var2 & !is.na(Var1) & !is.na(Var2), NA,
coalesce(Var1, Var2)))
# Var1 Var2 r
# 1 1 1 1
# 2 2 1 NA
# 3 3 1 NA
# 4 NA 1 1
# 5 1 2 NA
# 6 2 2 2
# 7 3 2 NA
# 8 NA 2 2
# 9 1 3 NA
# 10 2 3 NA
# 11 3 3 3
# 12 NA 3 3
# 13 1 NA 1
# 14 2 NA 2
# 15 3 NA 3
# 16 NA NA NA
We can also go back to ifelse
for a nice vectorized solution. I've wrapped it in a function as in @dayne's answer, but I've using the vectorized ifelse
rather than if(){}else{}
and an external call to mapply
gets a big speed improvement (though rowMeans
is still fastest):
getClass3 <- function(c1, c2) {
ifelse((!is.na(c1) & !is.na(c2)),
ifelse(c1 == c2, c1, NA),
ifelse(is.na(c1), c2, c1))
}
microbenchmark(plafortune = {
r <- rowMeans(std, na.rm = TRUE)
is.na(r) <- !r %in% 1:3 | std[, 1] != std[, 2]
},
dayne = {
mapply(getClass2, c1 = std[, 1], c2 = std[, 2])
},
coal = {
ifelse(std[, 1] != std[, 2] & !is.na(std[, 1]) & !is.na(std[, 2]), NA, coalesce(std[, 1], std[, 2]))
},
getClass_ifelse = {
getClass3(std[, 1], std[, 2])
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# plafortune 10.09130 10.49593 18.95146 12.31516 14.46738 194.7095 100 a
# dayne 466.60288 499.47639 552.12454 529.53229 573.53311 823.2745 100 d
# coal 20.70184 24.10026 40.87038 26.22795 31.20252 217.3142 100 b
# getClass_ifelse 50.90161 56.41823 96.69930 64.78723 95.32416 262.2016 100 c
Running on the large data (1e5 rows), rowMeans
is definitely fastest. Coalesce does pretty well, and the vectorized ifelse
is still an order of magnitude faster than the 1-line-at-a-time version. Worth noting that if there were more columns involved the rowMeans
advantage would probably grow, and it would also by far be the easiest to cod.
Upvotes: 2
Reputation: 7784
I think this gives what you are asking for:
getClass <- function(c1, c2) {
if (!is.na(c1) && !is.na(c2)) {
return(NA)
} else {
return(ifelse(is.na(c1), c2, c1))
}
NA
}
c1 <- c(1, 2, NA, 3, NA, NA, 2, NA, 1)
c2 <- c(NA, NA, 1, 2, 1, 3, NA, NA, NA)
mapply(getClass, c1 = c1, c2 = c2)
# [1] 1 2 1 NA 1 3 2 NA 1
EDIT
If you want values the have the same class to return that class, just modify the first if
statement:
getClass2 <- function(c1, c2) {
if (!is.na(c1) && !is.na(c2) && c1 != c2) {
return(NA)
} else {
return(ifelse(is.na(c1), c2, c1))
}
NA
}
c1 <- c(1, 2, NA, 3, NA, NA, 2, NA, 1, 1, 2, 3)
c2 <- c(NA, NA, 1, 2, 1, 3, NA, NA, NA, 1, 2, 3)
mapply(getClass2, c1 = c1, c2 = c2)
# [1] 1 2 1 NA 1 3 2 NA 1 1 2 3
Upvotes: 1
Reputation: 28441
You may be looking for rowMeans
as a logical shortcut.
rowMeans(mydata, na.rm=TRUE)
Example
#Create example with all possible combinations
std <- expand.grid(c(1:3,NA), c(1:3,NA))
ind <- apply(std, 1, function(x) anyDuplicated(x) | any(is.na(x)))
mydata <- std[ind,]
mydata
# Var1 Var2
# 1 1 1
# 4 NA 1
# 6 2 2
# 8 NA 2
# 11 3 3
# 12 NA 3
# 13 1 NA
# 14 2 NA
# 15 3 NA
# 16 NA NA
The example is set up. Here all the possible ways of combining 1 to 3 and NA. We use rowMeans
to solve the problem:
mydata$SClass <- rowMeans(mydata, na.rm=TRUE)
mydata
# Var1 Var2 SClass
# 1 1 1 1
# 4 NA 1 1
# 6 2 2 2
# 8 NA 2 2
# 11 3 3 3
# 12 NA 3 3
# 13 1 NA 1
# 14 2 NA 2
# 15 3 NA 3
# 16 NA NA NaN
Edit
It makes no difference if there are also some mismatches. We can add:
r <- rowMeans(std, na.rm=TRUE)
is.na(r) <- !r %in% 1:3 | std[,1] != std[,2]
#Verification
cbind(std, r)
Var1 Var2 r
1 1 1 1
2 2 1 NA
3 3 1 NA
4 NA 1 1
5 1 2 NA
6 2 2 2
7 3 2 NA
8 NA 2 2
9 1 3 NA
10 2 3 NA
11 3 3 3
12 NA 3 3
13 1 NA 1
14 2 NA 2
15 3 NA 3
16 NA NA NA
Verify above that all possible combinations are correct.
Speed Test
Something for the doubters. 5000% faster
Unit: milliseconds
expr min lq mean median uq max neval cld
plafortune 7.370385 9.246964 10.44307 10.10766 11.55795 18.72463 100 a
dayne 443.972804 506.965996 555.80049 550.91229 582.45713 831.18534 100 b
Data
std <- data.frame(x=sample(c(1:3,NA), 1e5, T), y=sample(c(1:3,NA), 1e5, T))
getClass <- function(c1, c2) {
if (!is.na(c1) && !is.na(c2)) {
return(NA)
} else {
return(ifelse(is.na(c1), c2, c1))
}
NA
}
library(microbenchmark)
microbenchmark(plafortune={r <- rowMeans(std, na.rm=TRUE)
is.na(r) <- !r %in% 1:3 | std[,1] != std[,2]},
dayne = {mapply(getClass, c1 = std[,1], c2 = std[,2])})
Upvotes: 2