Reputation: 103
The first part of this assignment is to import data from an external http site; the data contains eight variables with 1,339 observations. Four of those variables (age, height, weight, igf1) contain NA values within them (NOTE: the other variables may also have NA values, but I'm not concerned with them). I need to eliminate the NA values in those four variables: this is where I'm struggling.
Here is what I have so far:
#imports dataset from internet
importData <- read.table("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv", sep=',', header=T)
#inspects the data:
str(importData)
Basically, I want to remove ALL NA values in age, height, weight, and igf1. I'll know I'm successful when I have 858 observations remaining.
Three of the variables (height, weight, igf1) contain FACTOR type information. One of the variables (age) contains numeric information. I have been unable to successfully implement complete.cases and/or na.omit across them: those functions only seem to work on $age, where they eliminate the five NA values (but don't touch the other variables)
I need help cutting out the NA values in the remaining variables. Again, when I'm done, I should have 858 observations.
Upvotes: 2
Views: 37724
Reputation: 10060
If you are reading data from file, use the read.table option bellow:
read.table(...,quote="")
If you conditionally created NAs, like:
df[condition] <- NA
make sure you never use it quoted ("NA").
Upvotes: 0
Reputation: 99331
Why not make things a lot less stressful and read the .txt
version of the data instead? I found it in the data list on the same page as the .csv
. It isn't spaced abnormally like the .csv
file and there is no fiddling around with read.csv
arguments (which I did for quite a while before finding the other file).
con <- "http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.txt"
dat <- read.table(con, header = TRUE)
sapply(dat, class)
# age height menarche sex igf1 tanner testvol weight
# "numeric" "numeric" "integer" "numeric" "numeric" "integer" "integer" "numeric"
cols <- c("age", "height", "weight", "igf1")
resultSet <- dat[complete.cases(dat[cols]), ]
dim(resultSet)
# [1] 858 8
Upvotes: 6
Reputation: 16090
Read in your data using as.is=TRUE
and na.strings = c(NA, "NA", " NA"))
. Your file has spaces between fields which are interpreted as part of the field.
data <- read.table("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv",
sep=",", header=TRUE, as.is=TRUE, na.strings=c(NA, "NA", " NA"))
If you want to remove rows where any of the selected columns are NA
use complete.cases
:
new.data <- data[complete.cases(data[ ,c("age", "height", "weight", "igf1")]), ]
nrow(new.data)
# [1] 858
If you want to remove only the rows where all of the selected columns are NA
, use rowSums
:
new.data <- data[rowSums(is.na(data[ ,c("age", "height", "weight", "igf1")])) < 4, ]
nrow(new.data)
# [1] 1339 # there aren't any such rows :)
This isolated the four columns, tests whether they are NA
, and for each row calculates how many are NA
, returning only the ones that have fewer than 4 NA
s in each row.
Upvotes: 4
Reputation: 8488
The problem is that your data file encloses all values in quotes, like this:
"age", "height", "menarche", "sex", "igf1", "tanner", "testvol", "weight"
"NA", "NA", "NA", "NA", "90.0", "NA", "NA", "NA"
"NA", "NA", "NA", "NA", "88.0", "NA", "NA", "NA"
When R reads that, it takes all values as strings, which by default represent as factors. The NA
s are just taken as the label for one level of those factors. You can see this with str
:
> str(importData)
'data.frame': 1339 obs. of 8 variables:
$ age : num NA NA NA NA NA 0.17 0.17 0.17 0.17 0.17 ...
$ height : Factor w/ 600 levels " 110.8"," 111.5",..: 600 600 600 600 600 600 600 600 600 600 ...
$ menarche: Factor w/ 3 levels " 1"," 2"," NA": 3 3 3 3 3 3 3 3 3 3 ...
$ sex : Factor w/ 3 levels " 1.00"," 2.00",..: 3 3 3 3 3 1 1 1 1 1 ...
$ igf1 : Factor w/ 501 levels " 100.0"," 101.0",..: 490 487 53 55 23 2 498 6 10 474 ...
$ tanner : Factor w/ 6 levels " 1"," 2"," 3",..: 6 6 6 6 6 1 1 1 1 1 ...
$ testvol : Factor w/ 26 levels " 1"," 10"," 11",..: 26 26 26 26 26 26 26 26 26 26 ...
$ weight : Factor w/ 518 levels " 14.1"," 17.9",..: 518 518 518 518 518 518 518 518 518 518 ...
So read your data without making factors, which reads all as numeric:
> importData <- read.csv("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv",
stringsAsFactors=F, na.strings=c(NA,"NA"," NA"))
> str(importData)
'data.frame': 1339 obs. of 8 variables:
$ age : num NA NA NA NA NA 0.17 0.17 0.17 0.17 0.17 ...
$ height : num NA NA NA NA NA NA NA NA NA NA ...
$ menarche: int NA NA NA NA NA NA NA NA NA NA ...
$ sex : num NA NA NA NA NA 1 1 1 1 1 ...
$ igf1 : num 90 88 164 166 131 101 97 106 111 79 ...
$ tanner : int NA NA NA NA NA 1 1 1 1 1 ...
$ testvol : int NA NA NA NA NA NA NA NA NA NA ...
$ weight : num NA NA NA NA NA NA NA NA NA NA ...
and now remove NA
s:
> data <- importData[complete.cases(importData[c("age","height","weight","igf1")]),]
> str(data)
'data.frame': 858 obs. of 8 variables:
$ age : num 6 6.08 6.26 6.4 6.42 6.43 6.61 6.63 6.7 6.72 ...
$ height : num 112 117 120 116 116 ...
$ menarche: int NA NA NA NA NA NA NA NA NA NA ...
$ sex : num 1 1 1 1 1 1 1 1 1 1 ...
$ igf1 : num 98 242 196 179 126 142 236 148 174 136 ...
$ tanner : int 1 1 1 1 1 1 1 1 1 1 ...
$ testvol : int 1 1 1 1 1 1 1 2 1 1 ...
$ weight : num 19.1 21.7 24.7 19.6 20.6 20.2 28 21.6 26.1 22.6 ...
Upvotes: 7