Reputation: 1291
I have a dataframe like this
dep_delay temp humid wind_dir precip pressure date
16983 3 68.00 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
...
29299 -1 NA NA NA NA NA 2013-12-31
29300 33 NA NA NA NA NA 2013-12-31
I want to drop only the rows like 29299 and 29300, which contain 5 NA
s from temp to pressure (these are consecutive columns), and keep the rows like 16983 and 26477.
desired result:
dep_delay temp humid wind_dir precip pressure date
16983 3 68.00 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
In other words, the problem is how to remove only the rows where there are at least 5 NA
s in a row.
apparently this is not the right way to do it:
df <- df[!is.na(df$temp:df$pressure),]
Upvotes: 2
Views: 90
Reputation: 3414
Updated based on Yacine Jajji comment.
You can use standard filter
function in dplyr
package. You set the number of columns which should be never NA
. In your case there are 2
: dep_delay
and date
. Then calculate amount of NA
in each row, if the number equals 5
the row will be filtered out. See the code below:
df <- read.table( text = "dep_delay temp humid wind_dir precip pressure date
16983 3 68.00 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
29299 -1 NA NA NA NA NA 2013-12-31
29300 33 NA NA NA NA NA 2013-12-31")
library(dplyr)
cols_to_remove <- c("temp", "humid", "wind_dir", "precip", "pressure")
df[rowSums(is.na(df[, cols_to_remove])) !=
ncol(df[, cols_to_remove]), ]
Output:
dep_delay temp humid wind_dir precip pressure date
16983 3 68 53.06 NA 0 1020.8 2013-05-07
26477 42 NA 64.93 360 0 NA 2013-03-07
Upvotes: 2