Rice Legend
Rice Legend

Reputation: 63

Levels of factor still show after deleting the elements

Im just reading in a data-set and after deleting rows with "?" or NA's as you would call it, it apears to still show when you type:

levels(Sample$occupation)
[1] " ?" " Adm-clerical" " Armed-Forces" " Craft-repair"
[5] " Exec-managerial" " Farming-fishing" " Handlers-cleaners" " Machine-op-inspct"
[9] " Other-service" " Priv-house-serv" " Prof-specialty" " Protective-serv"
[13] " Sales" " Tech-support" " Transport-moving"

Also when you use the str function. But when I use the nrow command or subset(Sample, occupation==" ?"), it seems like it has been deleted. Do you have an explanation for this? The full data set can be found on http://archive.ics.uci.edu/ml/datasets/Adult I have an other version, but i think it is this one. :)

#Uploading data set
        mappesti <- paste0(file_content,"\\2. cand.merc.(mat)\\6. Data Science\\Reidar\\")

        data <- read.table(paste0(mappesti,"adult.txt"),header=F,sep=",")

#Naming data set
        colnames(data) <- c("age",
        "workclass",
        "fnlwgt",
        "education",
        "education.num",
        "marital.status",
        "occupation",
        "relationship",
        "race",
        "sex",
        "capital.gain",
        "capital.loss",
        "hours.per.week",
        "native.country",
        "class")


        length(data$occupation[data$occupation==" ?"])
        length(data$native.country[data$native.country==" ?"])
        length(data$workclass[data$workclass==" ?"])

#Deleting rows with " ?"
        Sample <- data
        str(Sample)
        subset(Sample, occupation==" ?")
        Sample <- subset(Sample, occupation!=" ?")
        Sample <- subset(Sample, native.country!=" ?")
        Sample <- subset(Sample, workclass!=" ?")
        subset(Sample, occupation==" ?")

        nrow(Sample)
        levels(Sample$occupation)

Upvotes: 2

Views: 308

Answers (1)

G5W
G5W

Reputation: 37641

Yes, factors can have levels even if there are no points with that value.

F1 = factor(c("red", "blue", "red"), levels=c("red", "blue", "green"))
table(F1)
F1
  red  blue green 
    2     1     0 

This is the desirable behavior. Just because I do not have any green points now, doesn't mean that I wouldn't have any later. If there was no level for green, I couldn't just add a green point. However, as @A5C1D2H2I1M1N2O1R2T1 commented, you can drop all levels that are not in use with droplevel.

F2 = droplevels(F1)
table(F2)
F2
 red blue 
   2    1 

Upvotes: 3

Related Questions