Carrol
Carrol

Reputation: 1285

Get dataframe rows matching conditions with uneven rows length

Let there be a dataframe with uneven rows length, of unknown columns -i.e each row may be of a different length, but all NA values are always at the end. There are also three values: start, penultimate and last.

Problem: how to (elegantly, without nested loops) find all rows on the data frame that match that condition?

Example: For the following dataframe and values:

df <- structure(list(V1 = c("a", "a", "a", "a", "b"), V2 = c("b", "n", "t", "o", "l"), V3 = c("c", "m", "h", "j", "p"), V4 = c("d", "c", "j", "", "e"), V5 = c("", "d", "", "", "")), 
.Names = c("V1", "V2", "V3", "V4", "V5"), 
row.names = c(NA, 5L), class = "data.frame")
df[df == ""] <- NA

start <- "a"
penultimate <- "c"
last <- "d"

The desired output would be the following subset:

  V1 V2 V3 V4   V5
1  a  b  c  d  [NA]
2  a  n  m  c   d

Upvotes: 2

Views: 206

Answers (3)

CPak
CPak

Reputation: 13581

You can use regex expressions to your advantage here

pattern <- paste0("^", start, ".*", penultimate, last, "$")
# "^a.*cd$"
index <- grepl(pattern, apply(df, 1, function(i) paste(i[!is.na(i)], collapse="")))
# [1]  TRUE  TRUE FALSE FALSE FALSE
df[index,]
#   V1 V2 V3 V4   V5
# 1  a  b  c  d <NA>
# 2  a  n  m  c    d

Upvotes: 1

C. Braun
C. Braun

Reputation: 5201

Here's one way using base R:

output <- apply(df, 1, function(row) {
    index_last <- max(which(!is.na(row)))
    if (row[1] == start & row[index_last - 1] == penultimate & row[index_last] == last) {
        return(row)
    }
    return(NULL)
})

This gives a list of the filtered rows which we can rbind back into a data.frame:

> do.call(rbind, output)
  V1  V2  V3  V4  V5 
1 "a" "b" "c" "d" NA 
2 "a" "n" "m" "c" "d"

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389125

I managed to solve it with apply with MARGIN=1 however, I doubt about it's efficiency.

df[apply(df, 1, function(x) {
    temp = x[!is.na(x)]
    temp[1] == start & tail(temp, 1) == last & tail(temp, 2)[1] == penultimate
}), ]

#  V1 V2 V3 V4   V5
#1  a  b  c  d <NA>
#2  a  n  m  c    d

For each row, we first remove all the NA elements and then check the conditions (start, last and penultimate) and subset the rows using the boolean indices.

Upvotes: 2

Related Questions