Reputation: 1062
Sorry for beginner questions.
I have a data frame(I think, please correct me if I'm wrong here.)
data <- read.csv("adult.data", sep=',', header=F)
Data is https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
When data is missing, it just has "?" instead of data. I need to count how much data is missing in each column.
I can count instances of a number, but not strings.
Col 1 is age, so I can do this:
length(which(data[,1] == 55))
And it will tell me how many people were 55 in this dataset.
But if I try
length(which(data[,2] == "?"))
It says 0.
How do I compare strings in R?
Upvotes: 1
Views: 3286
Reputation: 898
Those answerers above were sharp enough to spot the problem "by-eye". I took the pedestrian route:
unique(grep("\\?", df[,2], value = TRUE))
that showed me the problem was a space before each of the question marks. Not remembering the na.strings and strip.whitespace options, (thanks for the reminder!) I just:
colSums(df == " ?")
Now that I see it, reading the data correctly in the first place is obviously the better way. I only add this to show one way I use to hunt for string data problems when my "eyeball technique" fails me.
Upvotes: 2
Reputation: 99321
It looks like if you read it in again with na.strings = "?"
and strip.white = TRUE
, you'll get proper NA
values and be able to use is.na()
df <- read.csv(
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
header = FALSE,
na.strings = "?",
strip.white = TRUE
)
## total NA in the data
sum(is.na(df))
# [1] 4262
## total NA for column 2
sum(is.na(df[[2]]))
# [1] 1836
## count NA by column
colSums(is.na(df))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
# 0 1836 0 0 0 0 1843 0 0 0 0 0 0 583 0
Upvotes: 5