AndrewGB
AndrewGB

Reputation: 16856

Simultaneously remove the first and last rows of a data frame until reaching a row that does not have an NA

I have a dataframe that contains NA values, and I want to remove some rows that have an NA (i.e., not complete cases). However, I only want to remove rows at the beginning and ending of the dataframe. So, I want to keep any rows that have an NA that are not in the first or last rows of the dataframe. What is the most efficient way to simultaneously remove these rows with NAs without using a row index? This is related to my previous question, but I also want to remove the first rows at the same time. There are other posts that also focus on removing only the first rows, but not both.

Data

df <- structure(list(var1 = 1:15, 
                     var2 = c(3, NA, 3, NA, 2, NA, 3, 4, 2, NA, 4, 2, 45, 2, 1), 
                     var3 = c(6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, NA, NA, NA, NA), 
                     var4 = c(NA, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, NA)), 
                class = "data.frame", row.names = c(NA, -15L))

Expected Output

So, in this example, I removed rows 1 to 2, and 12 to 15 since they have an NA and row 3 and 11 does not have an NA.

  var1 var2 var3 var4
1    3    3    8    8
2    4   NA    9    9
3    5    2   10   10
4    6   NA   11   11
5    7    3   12   12
6    8    4   13   13
7    9    2   14   14
8   10   NA   15   15
9   11    4   16   16

I know that I could have 2 statements in filter to remove the top and bottom rows (shown below). But I'm wondering if there is a more efficient way to do this with really large datasets (open to any method tidyverse, base R, data.table, etc.).

library(dplyr)

df %>% 
  filter(cumsum(complete.cases(.)) != 0 & 
           rev(cumsum(rev(complete.cases(.)))) != 0)

Upvotes: 1

Views: 853

Answers (7)

hello_friend
hello_friend

Reputation: 5788

Bit late to the party, but base R in a single expression:

df[Reduce(
  function(x, y){
    seq(from = x, to = y)
  },
  range(
    which(
    complete.cases(df)
    )
  )
), ]

Upvotes: 1

AndrewGB
AndrewGB

Reputation: 16856

Benchmark

Here, I create a bigger dataset with 1,000,000 million rows of 3 variables to determine which method is the fastest. *Note: It will take a few seconds to apply the NA values randomly to the 3 columns for the first 100,000 rows and the last 100,000 rows. Essentially, with this example, we want to remove the first 100,000 rows and the last 100,000 rows.

Dataset

set.seed(203)
df <- data.frame(var1 = sample(x = 1:500, size = 1000000, replace = TRUE),
                 var2 = sample(x = 1:500, size = 1000000, replace = TRUE),
                 var3 = sample(x = 1:500, size = 1000000, replace = TRUE))
df[1:100000,] <- plyr::ddply(df[1:100000,], .(var1, var2, var3), function(x) {x[sample(x = 1:3, size = 1, replace = TRUE)] <- NA;x})
df[900000:1000000,] <- plyr::ddply(df[900000:1000000,], .(var1, var2, var3), function(x) {x[sample(x = 1:3, size = 1, replace = TRUE)] <- NA;x})
df[300000:400000,2] <- NA

Output

enter image description here

It looks like @MerijnvanTilborg data.table solution is the fastest, followed by @r2evans data.table version on this sample dataset.

Code

library(tidyverse)
library(data.table)

df1 <- df
dt1 <- as.data.table(df)
dt2 <- as.data.table(df)

bm <- microbenchmark::microbenchmark(baseR_r2evans = {r <- rle(complete.cases(df1)); 
r$values[ -c(1, length(r$values)) ] <- TRUE; df[inverse.rle(r),]},
dplyr_r2evans = {df %>%
    dplyr::mutate(aux = complete.cases(cur_data())) %>%
    dplyr::filter(cumany(aux) & rev(cumany(rev(aux))))},
datatable_r2evans = {dt1[, aux := complete.cases(.SD)
][ cumsum(aux) > 0 & rev(cumsum(rev(aux)) > 0), ]},
valkyr = {na_count <- rowSums(is.na(df)); df %>%
  dplyr::slice(min(which(na_count==0)):max(which(na_count==0)))},
PaulS = {df %>% 
    dplyr::mutate(aux = !complete.cases(.)) %>% 
    dplyr::filter(!cumall(aux)) %>% 
    dplyr::arrange(desc(var1)) %>% 
    dplyr::filter(!cumall(aux)) %>%
    dplyr::arrange(var1) %>% 
    dplyr::select(-aux)},
Chris = {df[(which(rle(rowSums(df))$values != 'NA')[1]):(which(rle(rowSums(df))$values != 'NA'))[[(length(which(rle(rowSums(df))$values != 'NA')))]],]},
AndrewGB = {df %>% 
    dplyr::filter(cumsum(complete.cases(.)) != 0 & 
                    rev(cumsum(rev(complete.cases(.)))) != 0)},
Merijn_baseR = {s <- which(complete.cases(df)); 
df[first(s):last(s), ]},
Merijn_datatable = {dt2[, aux := complete.cases(.SD)][first(which(aux)):last(which(aux))]},
times = 1000
)

Upvotes: 0

Merijn van Tilborg
Merijn van Tilborg

Reputation: 5887

I think we overcomplicated it a bit, most efficient I think is just plain base R

Directly take all your complete cases

s <- which(complete.cases(df))

We surely cannot subset on s, as we want to keep all the "in between" incomplete ones too, we can achieve that by simply subset from the first up till the last index.

df[first(s):last(s), ]

Upvotes: 2

r2evans
r2evans

Reputation: 160447

base R

r <- rle(complete.cases(df))
str(r, vec.len = 9)
# List of 2
#  $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
#  $ values : logi [1:9] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
#  - attr(*, "class")= chr "rle"
r$values[ -c(1, length(r$values)) ] <- TRUE
str(r, vec.len = 9)
# List of 2
#  $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
#  $ values : logi [1:9] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
#  - attr(*, "class")= chr "rle"
df[inverse.rle(r),]
#    var1 var2 var3 var4
# 3     3    3    8    8
# 4     4   NA    9    9
# 5     5    2   10   10
# 6     6   NA   11   11
# 7     7    3   12   12
# 8     8    4   13   13
# 9     9    2   14   14
# 10   10   NA   15   15
# 11   11    4   16   16

dplyr

For your question of efficiency, you can adapt the rle solution to dplyr as well (that should be trivial), but I see no reason why the use of complete.cases and cumany/rev would be a problem. You can improve on your attempt by not calculating complete.cases(.) twice as you're doing, storing it in an interim column.

library(dplyr)
df %>%
  mutate(aux = complete.cases(cur_data())) %>%
  filter(cumany(aux) & rev(cumany(rev(aux))))
#   var1 var2 var3 var4   aux
# 1    3    3    8    8  TRUE
# 2    4   NA    9    9 FALSE
# 3    5    2   10   10  TRUE
# 4    6   NA   11   11 FALSE
# 5    7    3   12   12  TRUE
# 6    8    4   13   13  TRUE
# 7    9    2   14   14  TRUE
# 8   10   NA   15   15 FALSE
# 9   11    4   16   16  TRUE

data.table

(Just an adaptation of the dplyr version.)

library(data.table)
setDT(df)
df[, aux := complete.cases(.SD)
  ][ cumsum(aux) > 0 & rev(cumsum(rev(aux)) > 0), ]
#     var1  var2  var3  var4    aux
#    <int> <num> <int> <int> <lgcl>
# 1:     3     3     8     8   TRUE
# 2:     4    NA     9     9  FALSE
# 3:     5     2    10    10   TRUE
# 4:     6    NA    11    11  FALSE
# 5:     7     3    12    12   TRUE
# 6:     8     4    13    13   TRUE
# 7:     9     2    14    14   TRUE
# 8:    10    NA    15    15  FALSE
# 9:    11     4    16    16   TRUE

Upvotes: 2

PaulS
PaulS

Reputation: 25333

Another possible solution (thanks, @r2evans, for suggesting complete.cases):

library(dplyr)

df %>% 
  mutate(aux = !complete.cases(.)) %>% 
  filter(!cumall(aux)) %>% 
  arrange(desc(var1)) %>% 
  filter(!cumall(aux)) %>%
  arrange(var1) %>% 
  select(-aux)

#>   var1 var2 var3 var4
#> 1    3    3    8    8
#> 2    4   NA    9    9
#> 3    5    2   10   10
#> 4    6   NA   11   11
#> 5    7    3   12   12
#> 6    8    4   13   13
#> 7    9    2   14   14
#> 8   10   NA   15   15
#> 9   11    4   16   16

Upvotes: 1

Chris
Chris

Reputation: 2286

continuing a rle love fest:

(which(rle(rowSums(df_NA))$values != 'NA')[1]):dplyr::last(which(rle(rowSums(df_NA))$values != 'NA'))
[1]  3  4  5  6  7  8  9 10 11

or, dispensing with dplyr

(which(rle(rowSums(df_NA))$values != 'NA')[1]):(which(rle(rowSums(df_NA))$values != 'NA'))[[(length(which(rle(rowSums(df_NA))$values != 'NA')))]]
[1]  3  4  5  6  7  8  9 10 11

Upvotes: 1

Valkyr
Valkyr

Reputation: 431

I would do

na_count <- rowSums(is.na(df))
df <- df %>%
  slice(min(which(na_count==0)):max(which(na_count==0)))

Output

> df
  var1 var2 var3 var4
1    3    3    8    8
2    4   NA    9    9
3    5    2   10   10
4    6   NA   11   11
5    7    3   12   12
6    8    4   13   13
7    9    2   14   14
8   10   NA   15   15
9   11    4   16   16

Upvotes: 2

Related Questions