Reputation: 747
Does anyone have any idea what could be going on here? I'm trying to do imputation on NA values but I'm getting nowhere. Here is my dataframe. I'm including the whole thing only because I thought maybe it would be helpful to have the full thing instead of just the first n rows:
structure(list(INDEX = 1:6, TARGET_WINS = c(39L, 70L, 86L, 70L,
82L, 75L), TEAM_BATTING_H = c(1445L, 1339L, 1377L, 1387L, 1297L,
1279L), TEAM_BATTING_2B = c(194L, 219L, 232L, 209L, 186L, 200L
), TEAM_BATTING_3B = c(39L, 22L, 35L, 38L, 27L, 36L), TEAM_BATTING_HR = c(13L,
190L, 137L, 96L, 102L, 92L), TEAM_BATTING_BB = c(143L, 685L,
602L, 451L, 472L, 443L), TEAM_BATTING_SO = c(842, 1075, 917,
922, 920, 973), TEAM_BASERUN_SB = c(NA, 37L, 46L, 43L, 49L, 107L
), TEAM_BASERUN_CS = c(NA, 28L, 27L, 30L, 39L, 59L), TEAM_BATTING_HBP = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), TEAM_PITCHING_H = c(9364L, 1347L, 1377L, 1396L, 1297L, 1279L
), TEAM_PITCHING_HR = c(84L, 191L, 137L, 97L, 102L, 92L), TEAM_PITCHING_BB = c(927L,
689L, 602L, 454L, 472L, 443L), TEAM_PITCHING_SO = c(5456L, 1082L,
917L, 928L, 920L, 973L), TEAM_FIELDING_E = c(1011L, 193L, 175L,
164L, 138L, 123L), TEAM_FIELDING_DP = c(NA, 155L, 153L, 156L,
168L, 149L)), row.names = c(NA, 6L), class = "data.frame")
I look to see if there are any NA values
any(is.na(moneyball_training_data)) # TRUE
I find where these NA values are:
moneyball_training_data %>% summarise(across(, ~ any(is.na(.x))))
I look at the class of one of the variables that has NA values
class(moneyball_training_data$TEAM_BATTING_SO) # numeric
I try to impute it with the median value of that vector:
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- median(moneyball_training_data$TEAM_BATTING_SO)
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
But I still get TRUE when I ask if there are NA values...
But maybe I forgot to remove NA in the function call for medican so I try again with an na.rm = TRUE
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE)
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
But that doesn't work. So I find the median value another way and then use that value for the imputation:
median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE) # 750
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == NA] <- 750
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
But this doesn't impute 750 for the NA values. But maybe I should just use "" instead of NA:
moneyball_training_data$TEAM_BATTING_SO[moneyball_training_data$TEAM_BATTING_SO == ""] <- 750
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) # TRUE
But this doesn't work either. Anyone know why isn't this imputation working?
Upvotes: 0
Views: 44
Reputation: 1816
When creating the boolean
vector for subsetting you should use is.na()
which you already correctly used before and afterwards.
moneyball_training_data$TEAM_BATTING_SO[is.na(moneyball_training_data$TEAM_BATTING_SO)] <- median(moneyball_training_data$TEAM_BATTING_SO, na.rm = TRUE)
any(is.na(moneyball_training_data$TEAM_BATTING_SO)) #
# [1] FALSE
Upvotes: 1