thebonafortuna
thebonafortuna

Reputation: 103

Removing "NA" values in R (for some reason na.omit and complete.cases only work on one variable...)

The first part of this assignment is to import data from an external http site; the data contains eight variables with 1,339 observations. Four of those variables (age, height, weight, igf1) contain NA values within them (NOTE: the other variables may also have NA values, but I'm not concerned with them). I need to eliminate the NA values in those four variables: this is where I'm struggling.

Here is what I have so far:

#imports dataset from internet
importData <- read.table("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv", sep=',', header=T)
#inspects the data:
str(importData)

Basically, I want to remove ALL NA values in age, height, weight, and igf1. I'll know I'm successful when I have 858 observations remaining.

Three of the variables (height, weight, igf1) contain FACTOR type information. One of the variables (age) contains numeric information. I have been unable to successfully implement complete.cases and/or na.omit across them: those functions only seem to work on $age, where they eliminate the five NA values (but don't touch the other variables)

I need help cutting out the NA values in the remaining variables. Again, when I'm done, I should have 858 observations.

Upvotes: 2

Views: 37724

Answers (4)

fred
fred

Reputation: 10060

If you are reading data from file, use the read.table option bellow:

read.table(...,quote="")

If you conditionally created NAs, like:

df[condition] <- NA

make sure you never use it quoted ("NA").

Upvotes: 0

Rich Scriven
Rich Scriven

Reputation: 99331

Why not make things a lot less stressful and read the .txt version of the data instead? I found it in the data list on the same page as the .csv. It isn't spaced abnormally like the .csv file and there is no fiddling around with read.csv arguments (which I did for quite a while before finding the other file).

con <- "http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.txt"
dat <- read.table(con, header = TRUE)
sapply(dat, class)
#       age    height  menarche       sex      igf1    tanner   testvol    weight 
# "numeric" "numeric" "integer" "numeric" "numeric" "integer" "integer" "numeric" 
cols <- c("age", "height", "weight", "igf1")
resultSet <- dat[complete.cases(dat[cols]), ]
dim(resultSet)
# [1] 858   8

Upvotes: 6

Hugh
Hugh

Reputation: 16090

Read in your data using as.is=TRUE and na.strings = c(NA, "NA", " NA")). Your file has spaces between fields which are interpreted as part of the field.

data <- read.table("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv", 
        sep=",", header=TRUE, as.is=TRUE, na.strings=c(NA, "NA", " NA"))

If you want to remove rows where any of the selected columns are NA use complete.cases:

new.data <- data[complete.cases(data[ ,c("age", "height", "weight", "igf1")]), ]
nrow(new.data)
# [1] 858

If you want to remove only the rows where all of the selected columns are NA, use rowSums:

new.data <- data[rowSums(is.na(data[ ,c("age", "height", "weight", "igf1")])) < 4, ]
nrow(new.data)
# [1] 1339   # there aren't any such rows :)

This isolated the four columns, tests whether they are NA, and for each row calculates how many are NA, returning only the ones that have fewer than 4 NAs in each row.

Upvotes: 4

Juli&#225;n Urbano
Juli&#225;n Urbano

Reputation: 8488

The problem is that your data file encloses all values in quotes, like this:

"age", "height", "menarche", "sex", "igf1", "tanner", "testvol", "weight"
"NA", "NA", "NA", "NA", "90.0", "NA", "NA", "NA"
"NA", "NA", "NA", "NA", "88.0", "NA", "NA", "NA"

When R reads that, it takes all values as strings, which by default represent as factors. The NAs are just taken as the label for one level of those factors. You can see this with str:

> str(importData)
'data.frame':   1339 obs. of  8 variables:
 $ age     : num  NA NA NA NA NA 0.17 0.17 0.17 0.17 0.17 ...
 $ height  : Factor w/ 600 levels " 110.8"," 111.5",..: 600 600 600 600 600 600 600 600 600 600 ...
 $ menarche: Factor w/ 3 levels " 1"," 2"," NA": 3 3 3 3 3 3 3 3 3 3 ...
 $ sex     : Factor w/ 3 levels " 1.00"," 2.00",..: 3 3 3 3 3 1 1 1 1 1 ...
 $ igf1    : Factor w/ 501 levels " 100.0"," 101.0",..: 490 487 53 55 23 2 498 6 10 474 ...
 $ tanner  : Factor w/ 6 levels " 1"," 2"," 3",..: 6 6 6 6 6 1 1 1 1 1 ...
 $ testvol : Factor w/ 26 levels " 1"," 10"," 11",..: 26 26 26 26 26 26 26 26 26 26 ...
 $ weight  : Factor w/ 518 levels " 14.1"," 17.9",..: 518 518 518 518 518 518 518 518 518 518 ...

So read your data without making factors, which reads all as numeric:

> importData <- read.csv("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv",
                         stringsAsFactors=F, na.strings=c(NA,"NA"," NA"))
> str(importData)
'data.frame':   1339 obs. of  8 variables:
 $ age     : num  NA NA NA NA NA 0.17 0.17 0.17 0.17 0.17 ...
 $ height  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ menarche: int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex     : num  NA NA NA NA NA 1 1 1 1 1 ...
 $ igf1    : num  90 88 164 166 131 101 97 106 111 79 ...
 $ tanner  : int  NA NA NA NA NA 1 1 1 1 1 ...
 $ testvol : int  NA NA NA NA NA NA NA NA NA NA ...
 $ weight  : num  NA NA NA NA NA NA NA NA NA NA ...

and now remove NAs:

> data <- importData[complete.cases(importData[c("age","height","weight","igf1")]),]
> str(data)
'data.frame':   858 obs. of  8 variables:
 $ age     : num  6 6.08 6.26 6.4 6.42 6.43 6.61 6.63 6.7 6.72 ...
 $ height  : num  112 117 120 116 116 ...
 $ menarche: int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ igf1    : num  98 242 196 179 126 142 236 148 174 136 ...
 $ tanner  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ testvol : int  1 1 1 1 1 1 1 2 1 1 ...
 $ weight  : num  19.1 21.7 24.7 19.6 20.6 20.2 28 21.6 26.1 22.6 ...

Upvotes: 7

Related Questions