mtfurlan
mtfurlan

Reputation: 1062

R count number of specific string in data frame

Sorry for beginner questions.

I have a data frame(I think, please correct me if I'm wrong here.)

data <- read.csv("adult.data", sep=',', header=F)

Data is https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

When data is missing, it just has "?" instead of data. I need to count how much data is missing in each column.

I can count instances of a number, but not strings.

Col 1 is age, so I can do this:

length(which(data[,1] == 55))

And it will tell me how many people were 55 in this dataset.

But if I try

length(which(data[,2] == "?"))

It says 0.

How do I compare strings in R?

Upvotes: 1

Views: 3286

Answers (2)

Rick
Rick

Reputation: 898

Those answerers above were sharp enough to spot the problem "by-eye". I took the pedestrian route:

unique(grep("\\?", df[,2], value = TRUE))

that showed me the problem was a space before each of the question marks. Not remembering the na.strings and strip.whitespace options, (thanks for the reminder!) I just:

colSums(df == " ?")

Now that I see it, reading the data correctly in the first place is obviously the better way. I only add this to show one way I use to hunt for string data problems when my "eyeball technique" fails me.

Upvotes: 2

Rich Scriven
Rich Scriven

Reputation: 99321

It looks like if you read it in again with na.strings = "?" and strip.white = TRUE, you'll get proper NA values and be able to use is.na()

df <- read.csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", 
    header = FALSE, 
    na.strings = "?", 
    strip.white = TRUE
)

## total NA in the data
sum(is.na(df))
# [1] 4262

## total NA for column 2
sum(is.na(df[[2]]))
# [1] 1836

## count NA by column
colSums(is.na(df))
#   V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12  V13  V14  V15
#    0 1836    0    0    0    0 1843    0    0    0    0    0    0  583    0

Upvotes: 5

Related Questions