hachiko
hachiko

Reputation: 747

R after median imputation nothing changes

Does anyone have any idea what could be going on here? I'm trying to do imputation on NA values but I'm getting nowhere. Here is my dataframe. I'm including the whole thing only because I thought maybe it would be helpful to have the full thing instead of just the first n rows:

structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L, 
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L, 
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L, 
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L, 
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842, 1075, 917, 
922, 920, 973), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 107L
), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_, 
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L, 
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L, 
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L, 
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L, 
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")

I look to see if there are any NA values

any(is.na(moneyball_training_data)) # TRUE

I find where these NA values are:

moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))

I look at the class of one of the variables that has NA values

class(moneyball_training_data$TEAM_BATTING_SO) # numeric
  

I try to impute it with the median value of that vector:

moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- median(moneyball_training_data$TEAM_BATTING_SO)

any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE

But I still get TRUE when I ask if there are NA values...

But maybe I forgot to remove NA in the function call for medican so I try again with an na.rm = TRUE

moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE)

any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE

But that doesn't work. So I find the median value another way and then use that value for the imputation:

median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE) # 750

moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- 750

any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE

But this doesn't impute 750 for the NA values. But maybe I should just use "" instead of NA:

moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == ""] <- 750

any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE

But this doesn't work either. Anyone know why isn't this imputation working?

Upvotes: 0

Views: 44

Answers (1)

fabla
fabla

Reputation: 1816

When creating the boolean vector for subsetting you should use is.na() which you already correctly used before and afterwards.

moneyball_training_data$TEAM_BATTING_SO[is.na(moneyball_training_data$TEAM_BATTING_SO)] <- median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE)

any(is.na(moneyball_training_data$TEAM_BATTING_SO)) #
# [1] FALSE

Upvotes: 1

Related Questions