Clean dataframe column with null values by substituting a factor based on summary conditions

Question

I am sorry for the unclear title, but I am not exactly sure how to best explain the issue in words and will instead try using an example. I am working with a basketball dataset where certain rows in the position column have NAs. I would like to update the position column with the position with the closest average height in that position for that year. Here is an example dataframe:

df_player <- data.frame(id = 1:100, 
                        year = floor(runif(100,2000,2006)), 
                        height = runif(100,70,85), 
                        pos = sample(c("G","F","C",NA), size = 100, replace = TRUE))

I have also created a dataframe of average heights for each position and each year for ease of explaining my ideal solution. I do not intend to form this dataframe in a solution I am just providing it to help explain what a solution would need to do.

df_avg <- df_player%>%
  filter(!is.na(pos))%>%
  group_by(year, pos)%>%
  summarize(avg_height = mean(height))

For each player with a missing position in df_player, I would first like to match the missing player's year to the year in df_avg. Then, compare the player's height to the average heights for each position in that year. Using these comparisons, I could then fill the NA with the position that corresponds to the average position closest to the height. I would prefer to not do this with joins. An example in words:
-A player was drafted in the year 2003 but is missing position data. The player's height is 73.
-The average heights for guards, forwards, and centers ("G","F","C") in 2003 were 70, 72, and 76 respectively.
-The position for the player in 2003 would be updated to forward ("F").

I have solved a similar issue with numeric data using tidyverse group by and mutate. If height were missing a solution is illustrated below.

df_player%>%
group_by(year)%>%
  mutate(height = case_when(is.na(height)~median(height, na.rm = TRUE),TRUE~height))

I am hoping to find a tidyverse solution similar to this that I could use within a pipe but any assistance that you can provide is much appreciated.

Paul · Accepted Answer

The lapply function filters the average height data frame by year and finds the position with the minimum absolute difference between the players height and the average. If the position is missing then it is updated with closest position from y.

library(dplyr)

df_avg <- mutate(df_avg, pos = as.character(pos))

df_player <- df_player %>%
  as_tibble() %>%
  mutate(id = 1:nrow(df_player),
         pos = as.character(pos)) %>%
  split(.$id) %>%
  lapply(function(x, ref) {

    y <- ref %>%
      as_tibble() %>%
      filter(year == x$year) %>%
      mutate(diff = abs(ref[ref$year == x$year, ]$avg_height - as.numeric(x$height))) %>%
      top_n(1, desc(diff))

    mutate(x, pos = ifelse(is.na(pos), y$pos, pos))

  }, ref = df_avg) %>%
  bind_rows() %>%
  select(-id)

Update

This calculates and applies the means within lapply.

library(dplyr)

df_player <- tibble(id = 1:100, 
                        year = floor(runif(100,2000,2006)), 
                        height = runif(100,70,85), 
                        pos = sample(c("G","F","C",NA), size = 100, replace = TRUE))

df_player %>%
  mutate(id = 1:nrow(df_player)) %>%
  split(.$id) %>%
  lapply(function(x, ref) {

    y <- ref %>%
      filter(year == x$year,
             !is.na(pos)) %>%
      group_by(pos) %>%
      summarise(avg_height = mean(height, na.rm = TRUE)) %>%
      mutate(diff = abs(avg_height - as.numeric(x$height))) %>%
      top_n(1, desc(diff))

    mutate(x, pos = ifelse(is.na(pos), y$pos, pos))

  }, ref = df_player) %>%
  bind_rows() %>%
  select(-id)

Clean dataframe column with null values by substituting a factor based on summary conditions

Answers (2)

Update

Related Questions