Reputation: 29
Let's say I have a data frame like the one below and I need to identify each row where one or more missing values (NA) are followed by at least one valid value (any numerical). Can you help me?
a <- c(1, 'S06.4', 6.7, 7.0, 6.5, 7.0, 7.2, NA, NA, 6.6,6.7)
b <- c(2 ,'S06.2' ,5.0, NA, 4.9, 7.8, 9.3, 8.0, 7.8, 8.0,NA)
c <- c(3, 'S06.5', 7.0, 5.5, NA, NA, 7.2, 8.0, 7.6, NA,6.7)
d <- c(4, 'S06.5', 7.0, 7.0, 7.0, 6.9, 6.8, 9.0, 6.0, 6.6,6.7)
e <- c(5, 'S06.1', 6.7, NA, NA, NA, NA, NA, NA, NA,NA)
df <- data.frame(rbind(a,b,c,d,e))
colnames(df) <- c('id','dx','dia01','dia02','dia03','dia04','dia05','dia06','dia07','dia08','dia09')
Upvotes: 2
Views: 1366
Reputation: 83275
With:
df[rowSums(is.na(df[,3:10]) * !is.na(df[,4:11])) > 0,]
you get:
id dx dia01 dia02 dia03 dia04 dia05 dia06 dia07 dia08 dia09 a 1 S06.4 6.7 7 6.5 7 7.2 <NA> <NA> 6.6 6.7 b 2 S06.2 5 <NA> 4.9 7.8 9.3 8 7.8 8 <NA> c 3 S06.5 7 5.5 <NA> <NA> 7.2 8 7.6 <NA> 6.7
What this does:
is.na(df[,3:10])
check which of the values in the dia01
to dia08
columns is NA
and returns a logical matrix.!is.na(df[,4:11])
does the same for the next values in each row of df[,3:10]
and also returns a logical matrixrowSums
you check whether the conditions is met at least once in each row.In response to your comment: if you want to make sure the NA
is followed by a numeric value, you could alter to above solution to:
# first convert the 'dia*''-columns to numeric
df[-c(1,2)] <- lapply(df[-c(1,2)], function(x) as.numeric(as.character(x)))
# then do the same because values that can't converted to numeric will give NA
df[rowSums(is.na(df[,3:10]) * !is.na(df[,4:11])) > 0,]
Or without convert to numeric first:
df[rowSums(is.na(df[,3:10]) * !is.na(sapply(df[4:11], function(x) as.numeric(as.character(x))))) > 0,]
NOTE:
With the method you used to construct your example data, you will end up with all factor columns. Of which I suppose you don't want that.
A possibly correctly formatted example dataset would be:
df <- structure(list(id = c("1", "2", "3", "4", "5"),
dx = c("S06.4", "S06.2", "S06.5", "S06.5", "S06.1"),
dia01 = c(6.7, 5, 7, 7, 6.7),
dia02 = c(7, NA, 5.5, 7, NA),
dia03 = c(6.5, 4.9, NA, 7, NA),
dia04 = c(7, 7.8, NA, 6.9, NA),
dia05 = c(7.2, 9.3, 7.2, 6.8, NA),
dia06 = c(NA, 8, 8, 9, NA),
dia07 = c(NA, 7.8, 7.6, 6, NA),
dia08 = c(6.6, 8, NA, 6.6, NA),
dia09 = c(6.7, NA, 6.7, 6.7, NA)),
.Names = c("id", "dx", "dia01", "dia02", "dia03", "dia04", "dia05", "dia06", "dia07", "dia08", "dia09"),
row.names = c("a", "b", "c", "d", "e"),
class = "data.frame")
The proposed method works on that as well.
As noted by @Frank in the comments, it is mostly better to store your data in long format. With:
library(data.table)
setDT(df)[, 3:11 := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcols = 3:11][]
melt(df, id = 1:2)[, if(any(is.na(value) & !is.na(shift(value, type = 'lead')))) .SD, by = .(id, dx)]
you get:
id dx variable value
1: 1 S06.4 dia01 6.7
2: 1 S06.4 dia02 7.0
3: 1 S06.4 dia03 6.5
4: 1 S06.4 dia04 7.0
5: 1 S06.4 dia05 7.2
6: 1 S06.4 dia06 NA
7: 1 S06.4 dia07 NA
8: 1 S06.4 dia08 6.6
9: 1 S06.4 dia09 6.7
10: 2 S06.2 dia01 5.0
11: 2 S06.2 dia02 NA
12: 2 S06.2 dia03 4.9
13: 2 S06.2 dia04 7.8
14: 2 S06.2 dia05 9.3
15: 2 S06.2 dia06 8.0
16: 2 S06.2 dia07 7.8
17: 2 S06.2 dia08 8.0
18: 2 S06.2 dia09 NA
19: 3 S06.5 dia01 7.0
20: 3 S06.5 dia02 5.5
21: 3 S06.5 dia03 NA
22: 3 S06.5 dia04 NA
23: 3 S06.5 dia05 7.2
24: 3 S06.5 dia06 8.0
25: 3 S06.5 dia07 7.6
26: 3 S06.5 dia08 NA
27: 3 S06.5 dia09 6.7
Another alternative is:
setDT(df)[, 3:11 := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcols = 3:11][]
df[unique(melt(df, id = 1:2)[, .I[is.na(value) & !is.na(shift(value, type = 'lead'))], by = .(id, dx)], by = 'id')[,'id'], on = 'id']
The result of this approach is however still in wide format as presented in the first part of this answer.
Upvotes: 8
Reputation: 51592
Another idea is to use apply
with margin 1
to go over each row and compare the minimum index of NA with the maximum index of non-NA, i.e.
#convert to numeric first to capture only valid numbers (as in @Jaap's answer)
df[-c(1,2)] <- lapply(df[-c(1,2)], function(x) as.numeric(as.character(x)))
apply(d1, 1, function(i) min(which(is.na(i))) < max(which(!is.na(i))))
# a b c d e
# TRUE TRUE TRUE FALSE FALSE
#or
df[apply(d1, 1, function(i) min(which(is.na(i))) < max(which(!is.na(i)))),]
which gives,
id dx dia01 dia02 dia03 dia04 dia05 dia06 dia07 dia08 dia09 a 1 S06.4 6.7 7 6.5 7 7.2 <NA> <NA> 6.6 6.7 b 2 S06.2 5 <NA> 4.9 7.8 9.3 8 7.8 8 <NA> c 3 S06.5 7 5.5 <NA> <NA> 7.2 8 7.6 <NA> 6.7
Upvotes: 3
Reputation: 12569
Here is a solution with rle()
:
(I used the data definition from the answer of Jaap)
df <- structure(list(id = c("1", "2", "3", "4", "5"),
dx = c("S06.4", "S06.2", "S06.5", "S06.5", "S06.1"),
dia01 = c(6.7, 5, 7, 7, 6.7),
dia02 = c(7, NA, 5.5, 7, NA),
dia03 = c(6.5, 4.9, NA, 7, NA),
dia04 = c(7, 7.8, NA, 6.9, NA),
dia05 = c(7.2, 9.3, 7.2, 6.8, NA),
dia06 = c(NA, 8, 8, 9, NA),
dia07 = c(NA, 7.8, 7.6, 6, NA),
dia08 = c(6.6, 8, NA, 6.6, NA),
dia09 = c(6.7, NA, 6.7, 6.7, NA)),
.Names = c("id", "dx", "dia01", "dia02", "dia03", "dia04", "dia05", "dia06", "dia07", "dia08", "dia09"),
row.names = c("a", "b", "c", "d", "e"),
class = "data.frame")
R <- apply(is.na(df[-(1:2)]), 1, rle)
id.row <- function(r) {
first.na <- which(r$value)[1]
if (is.na(first.na)) return(FALSE)
if (first.na==length(r$value)) return(FALSE)
return(TRUE)
}
sapply(R, id.row)
#> sapply(R, id.row)
# a b c d e
# TRUE TRUE TRUE FALSE FALSE
Upvotes: 2