John
John

Reputation: 1471

Subset of rows containing NA (missing) values in a chosen column of a data frame

We have a data frame from a CSV file. The data frame DF has columns that contain observed values and a column (VaR2) that contains the date at which a measurement has been taken. If the date was not recorded, the CSV file contains the value NA, for missing data.

Var1  Var2 
10    2010/01/01
20    NA
30    2010/03/01

We would like to use the subset command to define a new data frame new_DF such that it only contains rows that have an NA' value from the column (VaR2). In the example given, only Row 2 will be contained in the new DF.

The command

new_DF<-subset(DF, DF$Var2=="NA") 

does not work, the resulting data frame has no row entries.

If in the original CSV file the Value NA are exchanged with NULL, the same command produces the desired result:

new_DF <- subset(DF, DF$Var2=="NULL")

How can I get this method working, if for the character string the value NA is provided in the original CSV file?

Upvotes: 135

Views: 442200

Answers (10)

jpsmith
jpsmith

Reputation: 17174

As another approach using the relatively recent hablar package also makes this easy with the find_na() function, and can be used as-is or within the dplyr pipe:

Some sample data

myCars <- mtcars[1:5,1:3]
myCars[c(1,5), 1] <- NA
myCars[c(3,4), 2] <- NA

#                    mpg cyl disp
# Mazda RX4           NA   6  160
# Mazda RX4 Wag     21.0   6  160
# Datsun 710        22.8  NA  108
# Hornet 4 Drive    21.4  NA  258
# Hornet Sportabout   NA   8  360

Code:

# native
hablar::find_na(myCars, cyl)

# dplyr 
myCars %>% 
  hablar::find_na(cyl)

Both return:

#                 mpg cyl disp
# Datsun 710     22.8  NA  108
# Hornet 4 Drive 21.4  NA  258

Note that hthis can be extended: hablar::find_na(myCars, c(cyl, mpg)) will return NA values from both columns, and hablar::find_na(myCars) will return all rows with NA in any column.

Upvotes: 0

Shriram Bhosle
Shriram Bhosle

Reputation: 56

This will give the tally of number of NA (TRUE) and FALSE values in the table:

table(is.na(df))

Upvotes: 0

castel32
castel32

Reputation: 1

In addition to the main answer, if you want all rows with one or more NAs:

DF_rows_with_NAs = DF[rowSums(is.na(DF)) > 0, ]

DF_rows_with_NAs

Verification :

sum(is.na(DF_rows_with_NAs)) 

This should equal the total NAs of the whole DF.

If you want all columns with NAs:

DF_columns_with_NAs=DF[ ,colSums(is.na(DF)) > 0]

DF_columns_with_NAs

Verification:

sum(is.na(ts_columns_with_NAs))

This should equal the total NAs of the whole DF.

Upvotes: 0

Martin Gal
Martin Gal

Reputation: 16978

Since dplyrs filter_all has been superseded

Scoped verbs (_if, _at, _all) have been superseded by the use of across() in an existing verb.

and the usage of across() in filter() is deprecated, Ronak Pol's answer needs a small update. To find all rows with an NA anywhere, we could use

library(dplyr)

DF %>% 
  filter(if_any(everything(), is.na))

to get

# A tibble: 1 x 2
   Var1 Var2  
  <dbl> <date>
1    20 NA   

Upvotes: 8

Ronak Pol
Ronak Pol

Reputation: 381

new_data <- data %>% filter_all(any_vars(is.na(.))) 

This should create a new data frame (new_data) with only the missing values in it.

Works best to keep a track of values that you might later drop because they had some columns with missing observations (NA).

Upvotes: 28

drhnis
drhnis

Reputation: 113

Try changing this:

new_DF<-dplyr::filter(DF,is.na(Var2)) 

Upvotes: 7

user3226167
user3226167

Reputation: 3439

complete.cases gives TRUE when all values in a row are not NA

DF[!complete.cases(DF), ]

Upvotes: 67

jstar
jstar

Reputation: 877

Prints all the rows with NA data:

tmp <- data.frame(c(1,2,3),c(4,NA,5));
tmp[round(which(is.na(tmp))/ncol(tmp)),]

Upvotes: -1

Joris Meys
Joris Meys

Reputation: 108523

Never use =='NA' to test for missing values. Use is.na() instead. This should do it:

new_DF <- DF[rowSums(is.na(DF)) > 0,]

or in case you want to check a particular column, you can also use

new_DF <- DF[is.na(DF$Var),]

In case you have NA character values, first run

Df[Df=='NA'] <- NA

to replace them with missing values.

Upvotes: 193

maressyl
maressyl

Reputation: 983

NA is a special value in R, do not mix up the NA value with the "NA" string. Depending on the way the data was imported, your "NA" and "NULL" cells may be of various type (the default behavior is to convert "NA" strings to NA values, and let "NULL" strings as is).

If using read.table() or read.csv(), you should consider the "na.strings" argument to do clean data import, and always work with real R NA values.

An example, working in both cases "NULL" and "NA" cells :

DF <- read.csv("file.csv", na.strings=c("NA", "NULL"))
new_DF <- subset(DF, is.na(DF$Var2))

Upvotes: 42

Related Questions