Reputation: 29

Find rows with a sequence of consecutive column values

Let's say I have a data frame like the one below and I need to identify each row where one or more missing values (NA) are followed by at least one valid value (any numerical). Can you help me?

a <- c(1, 'S06.4', 6.7, 7.0, 6.5, 7.0, 7.2, NA, NA, 6.6,6.7) 
b <- c(2 ,'S06.2' ,5.0, NA, 4.9, 7.8, 9.3, 8.0, 7.8, 8.0,NA)
c <- c(3, 'S06.5', 7.0, 5.5, NA, NA, 7.2, 8.0, 7.6, NA,6.7) 
d <- c(4, 'S06.5', 7.0, 7.0, 7.0, 6.9, 6.8, 9.0, 6.0, 6.6,6.7) 
e <- c(5, 'S06.1', 6.7, NA, NA, NA, NA, NA, NA, NA,NA) 

df <- data.frame(rbind(a,b,c,d,e))
colnames(df) <- c('id','dx','dia01','dia02','dia03','dia04','dia05','dia06','dia07','dia08','dia09')

Upvotes: 2

Answers (3)

Jaap

Reputation: 83275

With:

df[rowSums(is.na(df[,3:10]) * !is.na(df[,4:11])) > 0,]

you get:

  id    dx dia01 dia02 dia03 dia04 dia05 dia06 dia07 dia08 dia09
a  1 S06.4   6.7     7   6.5     7   7.2  <NA>  <NA>   6.6   6.7
b  2 S06.2     5  <NA>   4.9   7.8   9.3     8   7.8     8  <NA>
c  3 S06.5     7   5.5  <NA>  <NA>   7.2     8   7.6  <NA>   6.7

What this does:

is.na(df[,3:10]) check which of the values in the dia01 to dia08 columns is NA and returns a logical matrix.
!is.na(df[,4:11]) does the same for the next values in each row of df[,3:10] and also returns a logical matrix
Multiplying these two matrices gives a logical matrix for the required condition.
With rowSums you check whether the conditions is met at least once in each row.

In response to your comment: if you want to make sure the NA is followed by a numeric value, you could alter to above solution to:

# first convert the 'dia*''-columns to numeric
df[-c(1,2)] <- lapply(df[-c(1,2)], function(x) as.numeric(as.character(x)))
# then do the same because values that can't converted to numeric will give NA
df[rowSums(is.na(df[,3:10]) * !is.na(df[,4:11])) > 0,]

Or without convert to numeric first:

df[rowSums(is.na(df[,3:10]) * !is.na(sapply(df[4:11], function(x) as.numeric(as.character(x))))) > 0,]

NOTE:

With the method you used to construct your example data, you will end up with all factor columns. Of which I suppose you don't want that.

A possibly correctly formatted example dataset would be:

df <- structure(list(id = c("1", "2", "3", "4", "5"), 
                     dx = c("S06.4", "S06.2", "S06.5", "S06.5", "S06.1"), 
                     dia01 = c(6.7, 5, 7, 7, 6.7),
                     dia02 = c(7, NA, 5.5, 7, NA), 
                     dia03 = c(6.5, 4.9, NA, 7, NA),
                     dia04 = c(7, 7.8, NA, 6.9, NA),
                     dia05 = c(7.2, 9.3, 7.2, 6.8, NA),
                     dia06 = c(NA, 8, 8, 9, NA),
                     dia07 = c(NA, 7.8, 7.6, 6, NA),
                     dia08 = c(6.6, 8, NA, 6.6, NA),
                     dia09 = c(6.7, NA, 6.7, 6.7, NA)),
                .Names = c("id", "dx", "dia01", "dia02", "dia03", "dia04", "dia05", "dia06", "dia07", "dia08", "dia09"),
                row.names = c("a", "b", "c", "d", "e"),
                class = "data.frame")

The proposed method works on that as well.

As noted by @Frank in the comments, it is mostly better to store your data in long format. With:

library(data.table)
setDT(df)[, 3:11 := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcols = 3:11][]
melt(df, id = 1:2)[, if(any(is.na(value) & !is.na(shift(value, type = 'lead')))) .SD, by = .(id, dx)]

you get:

    id    dx variable value
 1:  1 S06.4    dia01   6.7
 2:  1 S06.4    dia02   7.0
 3:  1 S06.4    dia03   6.5
 4:  1 S06.4    dia04   7.0
 5:  1 S06.4    dia05   7.2
 6:  1 S06.4    dia06    NA
 7:  1 S06.4    dia07    NA
 8:  1 S06.4    dia08   6.6
 9:  1 S06.4    dia09   6.7
10:  2 S06.2    dia01   5.0
11:  2 S06.2    dia02    NA
12:  2 S06.2    dia03   4.9
13:  2 S06.2    dia04   7.8
14:  2 S06.2    dia05   9.3
15:  2 S06.2    dia06   8.0
16:  2 S06.2    dia07   7.8
17:  2 S06.2    dia08   8.0
18:  2 S06.2    dia09    NA
19:  3 S06.5    dia01   7.0
20:  3 S06.5    dia02   5.5
21:  3 S06.5    dia03    NA
22:  3 S06.5    dia04    NA
23:  3 S06.5    dia05   7.2
24:  3 S06.5    dia06   8.0
25:  3 S06.5    dia07   7.6
26:  3 S06.5    dia08    NA
27:  3 S06.5    dia09   6.7

Another alternative is:

setDT(df)[, 3:11 := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcols = 3:11][]
df[unique(melt(df, id = 1:2)[, .I[is.na(value) & !is.na(shift(value, type = 'lead'))], by = .(id, dx)], by = 'id')[,'id'], on = 'id']

The result of this approach is however still in wide format as presented in the first part of this answer.

Upvotes: 8

Sotos

Reputation: 51592

Another idea is to use apply with margin 1 to go over each row and compare the minimum index of NA with the maximum index of non-NA, i.e.

#convert to numeric first to capture only valid numbers (as in @Jaap's answer)
df[-c(1,2)] <- lapply(df[-c(1,2)], function(x) as.numeric(as.character(x)))

apply(d1, 1, function(i) min(which(is.na(i))) < max(which(!is.na(i))))
#    a     b     c     d     e 
# TRUE  TRUE  TRUE FALSE FALSE 

#or

df[apply(d1, 1, function(i) min(which(is.na(i))) < max(which(!is.na(i)))),]

which gives,

  id    dx dia01 dia02 dia03 dia04 dia05 dia06 dia07 dia08 dia09
a  1 S06.4   6.7     7   6.5     7   7.2  <NA>  <NA>   6.6   6.7
b  2 S06.2     5  <NA>   4.9   7.8   9.3     8   7.8     8  <NA>
c  3 S06.5     7   5.5  <NA>  <NA>   7.2     8   7.6  <NA>   6.7

Upvotes: 3

jogo

Reputation: 12569

Here is a solution with rle():
(I used the data definition from the answer of Jaap)

df <- structure(list(id = c("1", "2", "3", "4", "5"), 
                     dx = c("S06.4", "S06.2", "S06.5", "S06.5", "S06.1"), 
                     dia01 = c(6.7, 5, 7, 7, 6.7),
                     dia02 = c(7, NA, 5.5, 7, NA), 
                     dia03 = c(6.5, 4.9, NA, 7, NA),
                     dia04 = c(7, 7.8, NA, 6.9, NA),
                     dia05 = c(7.2, 9.3, 7.2, 6.8, NA),
                     dia06 = c(NA, 8, 8, 9, NA),
                     dia07 = c(NA, 7.8, 7.6, 6, NA),
                     dia08 = c(6.6, 8, NA, 6.6, NA),
                     dia09 = c(6.7, NA, 6.7, 6.7, NA)),
                .Names = c("id", "dx", "dia01", "dia02", "dia03", "dia04", "dia05", "dia06", "dia07", "dia08", "dia09"),
                row.names = c("a", "b", "c", "d", "e"),
                class = "data.frame")

R <- apply(is.na(df[-(1:2)]), 1, rle)
id.row <- function(r) {
  first.na <- which(r$value)[1]
  if (is.na(first.na)) return(FALSE)
  if (first.na==length(r$value)) return(FALSE)
  return(TRUE)
}
sapply(R, id.row)
#> sapply(R, id.row)
#    a     b     c     d     e 
# TRUE  TRUE  TRUE FALSE FALSE

Upvotes: 2

Find rows with a sequence of consecutive column values

Answers (3)

Related Questions