Denis P
Denis P

Reputation: 3

Clean dataframe column with null values by substituting a factor based on summary conditions

I am sorry for the unclear title, but I am not exactly sure how to best explain the issue in words and will instead try using an example. I am working with a basketball dataset where certain rows in the position column have NAs. I would like to update the position column with the position with the closest average height in that position for that year. Here is an example dataframe:

df_player <- data.frame(id = 1:100, 
                        year = floor(runif(100,2000,2006)), 
                        height = runif(100,70,85), 
                        pos = sample(c("G","F","C",NA), size = 100, replace = TRUE))

I have also created a dataframe of average heights for each position and each year for ease of explaining my ideal solution. I do not intend to form this dataframe in a solution I am just providing it to help explain what a solution would need to do.

df_avg <- df_player%>%
  filter(!is.na(pos))%>%
  group_by(year, pos)%>%
  summarize(avg_height = mean(height))

For each player with a missing position in df_player, I would first like to match the missing player's year to the year in df_avg. Then, compare the player's height to the average heights for each position in that year. Using these comparisons, I could then fill the NA with the position that corresponds to the average position closest to the height. I would prefer to not do this with joins. An example in words:
-A player was drafted in the year 2003 but is missing position data. The player's height is 73.
-The average heights for guards, forwards, and centers ("G","F","C") in 2003 were 70, 72, and 76 respectively.
-The position for the player in 2003 would be updated to forward ("F").

I have solved a similar issue with numeric data using tidyverse group by and mutate. If height were missing a solution is illustrated below.

df_player%>%
group_by(year)%>%
  mutate(height = case_when(is.na(height)~median(height, na.rm = TRUE),TRUE~height))

I am hoping to find a tidyverse solution similar to this that I could use within a pipe but any assistance that you can provide is much appreciated.

Upvotes: 0

Views: 161

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388807

Here is one way using dplyr and series of joins :

library(dplyr)

df_player %>%
  filter(is.na(pos)) %>%
  left_join(df_avg, by = 'year') %>%
  group_by(id) %>%
  mutate(pos.x = pos.y[which.min(abs(height - avg_height))]) %>%
  filter(!duplicated(id)) %>%
  right_join(df_player) %>%
  mutate(pos = coalesce(pos, pos.x)) %>%
  select(-pos.x, -pos.y, -avg_height)


#      id  year height pos  
#   <int> <dbl>  <dbl> <fct>
# 1     1  2001   74.9 F    
# 2     2  2001   75.8 F    
# 3     3  2003   70.6 G    
# 4     4  2000   75.4 C    
# 5     5  2002   78.6 F    
# 6     6  2002   80.3 G    
# 7     7  2004   84.6 C    
# 8     8  2002   80.5 F    
# 9     9  2003   70.2 C    
#10    10  2001   78.0 F    
# … with 90 more rows

data

set.seed(100)
df_player <- data.frame(id = 1:100, 
                        year = floor(runif(100,2000,2006)), 
                        height = runif(100,70,85), 
                        pos = sample(c("G","F","C",NA), size = 100, replace = TRUE))

Upvotes: 1

Paul
Paul

Reputation: 2959

The lapply function filters the average height data frame by year and finds the position with the minimum absolute difference between the players height and the average. If the position is missing then it is updated with closest position from y.

library(dplyr)

df_avg <- mutate(df_avg, pos = as.character(pos))

df_player <- df_player %>%
  as_tibble() %>%
  mutate(id = 1:nrow(df_player),
         pos = as.character(pos)) %>%
  split(.$id) %>%
  lapply(function(x, ref) {

    y <- ref %>%
      as_tibble() %>%
      filter(year == x$year) %>%
      mutate(diff = abs(ref[ref$year == x$year, ]$avg_height - as.numeric(x$height))) %>%
      top_n(1, desc(diff))

    mutate(x, pos = ifelse(is.na(pos), y$pos, pos))

  }, ref = df_avg) %>%
  bind_rows() %>%
  select(-id)

Update

This calculates and applies the means within lapply.

library(dplyr)

df_player <- tibble(id = 1:100, 
                        year = floor(runif(100,2000,2006)), 
                        height = runif(100,70,85), 
                        pos = sample(c("G","F","C",NA), size = 100, replace = TRUE))

df_player %>%
  mutate(id = 1:nrow(df_player)) %>%
  split(.$id) %>%
  lapply(function(x, ref) {

    y <- ref %>%
      filter(year == x$year,
             !is.na(pos)) %>%
      group_by(pos) %>%
      summarise(avg_height = mean(height, na.rm = TRUE)) %>%
      mutate(diff = abs(avg_height - as.numeric(x$height))) %>%
      top_n(1, desc(diff))

    mutate(x, pos = ifelse(is.na(pos), y$pos, pos))

  }, ref = df_player) %>%
  bind_rows() %>%
  select(-id)

Upvotes: 1

Related Questions