Reputation: 16856
I have a dataframe that contains NA
values, and I want to remove some rows that have an NA
(i.e., not complete cases). However, I only want to remove rows at the beginning and ending of the dataframe. So, I want to keep any rows that have an NA
that are not in the first or last rows of the dataframe. What is the most efficient way to simultaneously remove these rows with NA
s without using a row index? This is related to my previous question, but I also want to remove the first rows at the same time. There are other posts that also focus on removing only the first rows, but not both.
Data
df <- structure(list(var1 = 1:15,
var2 = c(3, NA, 3, NA, 2, NA, 3, 4, 2, NA, 4, 2, 45, 2, 1),
var3 = c(6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, NA, NA, NA, NA),
var4 = c(NA, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, NA)),
class = "data.frame", row.names = c(NA, -15L))
Expected Output
So, in this example, I removed rows 1 to 2, and 12 to 15 since they have an NA
and row 3 and 11 does not have an NA
.
var1 var2 var3 var4
1 3 3 8 8
2 4 NA 9 9
3 5 2 10 10
4 6 NA 11 11
5 7 3 12 12
6 8 4 13 13
7 9 2 14 14
8 10 NA 15 15
9 11 4 16 16
I know that I could have 2 statements in filter to remove the top and bottom rows (shown below). But I'm wondering if there is a more efficient way to do this with really large datasets (open to any method tidyverse
, base R, data.table
, etc.).
library(dplyr)
df %>%
filter(cumsum(complete.cases(.)) != 0 &
rev(cumsum(rev(complete.cases(.)))) != 0)
Upvotes: 1
Views: 853
Reputation: 5788
Bit late to the party, but base R in a single expression:
df[Reduce(
function(x, y){
seq(from = x, to = y)
},
range(
which(
complete.cases(df)
)
)
), ]
Upvotes: 1
Reputation: 16856
Benchmark
Here, I create a bigger dataset with 1,000,000 million rows of 3 variables to determine which method is the fastest. *Note: It will take a few seconds to apply the NA
values randomly to the 3 columns for the first 100,000 rows and the last 100,000 rows. Essentially, with this example, we want to remove the first 100,000 rows and the last 100,000 rows.
Dataset
set.seed(203)
df <- data.frame(var1 = sample(x = 1:500, size = 1000000, replace = TRUE),
var2 = sample(x = 1:500, size = 1000000, replace = TRUE),
var3 = sample(x = 1:500, size = 1000000, replace = TRUE))
df[1:100000,] <- plyr::ddply(df[1:100000,], .(var1, var2, var3), function(x) {x[sample(x = 1:3, size = 1, replace = TRUE)] <- NA;x})
df[900000:1000000,] <- plyr::ddply(df[900000:1000000,], .(var1, var2, var3), function(x) {x[sample(x = 1:3, size = 1, replace = TRUE)] <- NA;x})
df[300000:400000,2] <- NA
Output
It looks like @MerijnvanTilborg data.table
solution is the fastest, followed by @r2evans data.table
version on this sample dataset.
Code
library(tidyverse)
library(data.table)
df1 <- df
dt1 <- as.data.table(df)
dt2 <- as.data.table(df)
bm <- microbenchmark::microbenchmark(baseR_r2evans = {r <- rle(complete.cases(df1));
r$values[ -c(1, length(r$values)) ] <- TRUE; df[inverse.rle(r),]},
dplyr_r2evans = {df %>%
dplyr::mutate(aux = complete.cases(cur_data())) %>%
dplyr::filter(cumany(aux) & rev(cumany(rev(aux))))},
datatable_r2evans = {dt1[, aux := complete.cases(.SD)
][ cumsum(aux) > 0 & rev(cumsum(rev(aux)) > 0), ]},
valkyr = {na_count <- rowSums(is.na(df)); df %>%
dplyr::slice(min(which(na_count==0)):max(which(na_count==0)))},
PaulS = {df %>%
dplyr::mutate(aux = !complete.cases(.)) %>%
dplyr::filter(!cumall(aux)) %>%
dplyr::arrange(desc(var1)) %>%
dplyr::filter(!cumall(aux)) %>%
dplyr::arrange(var1) %>%
dplyr::select(-aux)},
Chris = {df[(which(rle(rowSums(df))$values != 'NA')[1]):(which(rle(rowSums(df))$values != 'NA'))[[(length(which(rle(rowSums(df))$values != 'NA')))]],]},
AndrewGB = {df %>%
dplyr::filter(cumsum(complete.cases(.)) != 0 &
rev(cumsum(rev(complete.cases(.)))) != 0)},
Merijn_baseR = {s <- which(complete.cases(df));
df[first(s):last(s), ]},
Merijn_datatable = {dt2[, aux := complete.cases(.SD)][first(which(aux)):last(which(aux))]},
times = 1000
)
Upvotes: 0
Reputation: 5887
I think we overcomplicated it a bit, most efficient I think is just plain base R
Directly take all your complete cases
s <- which(complete.cases(df))
We surely cannot subset on s, as we want to keep all the "in between" incomplete ones too, we can achieve that by simply subset from the first up till the last index.
df[first(s):last(s), ]
Upvotes: 2
Reputation: 160447
r <- rle(complete.cases(df))
str(r, vec.len = 9)
# List of 2
# $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
# $ values : logi [1:9] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
# - attr(*, "class")= chr "rle"
r$values[ -c(1, length(r$values)) ] <- TRUE
str(r, vec.len = 9)
# List of 2
# $ lengths: int [1:9] 2 1 1 1 1 3 1 1 4
# $ values : logi [1:9] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
# - attr(*, "class")= chr "rle"
df[inverse.rle(r),]
# var1 var2 var3 var4
# 3 3 3 8 8
# 4 4 NA 9 9
# 5 5 2 10 10
# 6 6 NA 11 11
# 7 7 3 12 12
# 8 8 4 13 13
# 9 9 2 14 14
# 10 10 NA 15 15
# 11 11 4 16 16
For your question of efficiency, you can adapt the rle
solution to dplyr as well (that should be trivial), but I see no reason why the use of complete.cases
and cumany
/rev
would be a problem. You can improve on your attempt by not calculating complete.cases(.)
twice as you're doing, storing it in an interim column.
library(dplyr)
df %>%
mutate(aux = complete.cases(cur_data())) %>%
filter(cumany(aux) & rev(cumany(rev(aux))))
# var1 var2 var3 var4 aux
# 1 3 3 8 8 TRUE
# 2 4 NA 9 9 FALSE
# 3 5 2 10 10 TRUE
# 4 6 NA 11 11 FALSE
# 5 7 3 12 12 TRUE
# 6 8 4 13 13 TRUE
# 7 9 2 14 14 TRUE
# 8 10 NA 15 15 FALSE
# 9 11 4 16 16 TRUE
(Just an adaptation of the dplyr version.)
library(data.table)
setDT(df)
df[, aux := complete.cases(.SD)
][ cumsum(aux) > 0 & rev(cumsum(rev(aux)) > 0), ]
# var1 var2 var3 var4 aux
# <int> <num> <int> <int> <lgcl>
# 1: 3 3 8 8 TRUE
# 2: 4 NA 9 9 FALSE
# 3: 5 2 10 10 TRUE
# 4: 6 NA 11 11 FALSE
# 5: 7 3 12 12 TRUE
# 6: 8 4 13 13 TRUE
# 7: 9 2 14 14 TRUE
# 8: 10 NA 15 15 FALSE
# 9: 11 4 16 16 TRUE
Upvotes: 2
Reputation: 25333
Another possible solution (thanks, @r2evans, for suggesting complete.cases
):
library(dplyr)
df %>%
mutate(aux = !complete.cases(.)) %>%
filter(!cumall(aux)) %>%
arrange(desc(var1)) %>%
filter(!cumall(aux)) %>%
arrange(var1) %>%
select(-aux)
#> var1 var2 var3 var4
#> 1 3 3 8 8
#> 2 4 NA 9 9
#> 3 5 2 10 10
#> 4 6 NA 11 11
#> 5 7 3 12 12
#> 6 8 4 13 13
#> 7 9 2 14 14
#> 8 10 NA 15 15
#> 9 11 4 16 16
Upvotes: 1
Reputation: 2286
continuing a rle
love fest:
(which(rle(rowSums(df_NA))$values != 'NA')[1]):dplyr::last(which(rle(rowSums(df_NA))$values != 'NA'))
[1] 3 4 5 6 7 8 9 10 11
or, dispensing with dplyr
(which(rle(rowSums(df_NA))$values != 'NA')[1]):(which(rle(rowSums(df_NA))$values != 'NA'))[[(length(which(rle(rowSums(df_NA))$values != 'NA')))]]
[1] 3 4 5 6 7 8 9 10 11
Upvotes: 1
Reputation: 431
I would do
na_count <- rowSums(is.na(df))
df <- df %>%
slice(min(which(na_count==0)):max(which(na_count==0)))
Output
> df
var1 var2 var3 var4
1 3 3 8 8
2 4 NA 9 9
3 5 2 10 10
4 6 NA 11 11
5 7 3 12 12
6 8 4 13 13
7 9 2 14 14
8 10 NA 15 15
9 11 4 16 16
Upvotes: 2