Reputation: 3
I am sorry for the unclear title, but I am not exactly sure how to best explain the issue in words and will instead try using an example. I am working with a basketball dataset where certain rows in the position column have NAs. I would like to update the position column with the position with the closest average height in that position for that year. Here is an example dataframe:
df_player <- data.frame(id = 1:100,
year = floor(runif(100,2000,2006)),
height = runif(100,70,85),
pos = sample(c("G","F","C",NA), size = 100, replace = TRUE))
I have also created a dataframe of average heights for each position and each year for ease of explaining my ideal solution. I do not intend to form this dataframe in a solution I am just providing it to help explain what a solution would need to do.
df_avg <- df_player%>%
filter(!is.na(pos))%>%
group_by(year, pos)%>%
summarize(avg_height = mean(height))
For each player with a missing position in df_player, I would first like to match the missing player's year to the year in df_avg. Then, compare the player's height to the average heights for each position in that year. Using these comparisons, I could then fill the NA with the position that corresponds to the average position closest to the height. I would prefer to not do this with joins.
An example in words:
-A player was drafted in the year 2003 but is missing position data. The player's height is 73.
-The average heights for guards, forwards, and centers ("G","F","C") in 2003 were 70, 72, and 76 respectively.
-The position for the player in 2003 would be updated to forward ("F").
I have solved a similar issue with numeric data using tidyverse group by and mutate. If height were missing a solution is illustrated below.
df_player%>%
group_by(year)%>%
mutate(height = case_when(is.na(height)~median(height, na.rm = TRUE),TRUE~height))
I am hoping to find a tidyverse solution similar to this that I could use within a pipe but any assistance that you can provide is much appreciated.
Upvotes: 0
Views: 161
Reputation: 388807
Here is one way using dplyr
and series of joins :
library(dplyr)
df_player %>%
filter(is.na(pos)) %>%
left_join(df_avg, by = 'year') %>%
group_by(id) %>%
mutate(pos.x = pos.y[which.min(abs(height - avg_height))]) %>%
filter(!duplicated(id)) %>%
right_join(df_player) %>%
mutate(pos = coalesce(pos, pos.x)) %>%
select(-pos.x, -pos.y, -avg_height)
# id year height pos
# <int> <dbl> <dbl> <fct>
# 1 1 2001 74.9 F
# 2 2 2001 75.8 F
# 3 3 2003 70.6 G
# 4 4 2000 75.4 C
# 5 5 2002 78.6 F
# 6 6 2002 80.3 G
# 7 7 2004 84.6 C
# 8 8 2002 80.5 F
# 9 9 2003 70.2 C
#10 10 2001 78.0 F
# … with 90 more rows
data
set.seed(100)
df_player <- data.frame(id = 1:100,
year = floor(runif(100,2000,2006)),
height = runif(100,70,85),
pos = sample(c("G","F","C",NA), size = 100, replace = TRUE))
Upvotes: 1
Reputation: 2959
The lapply
function filters the average height data frame by year and finds the position with the minimum absolute difference between the players height and the average. If the position is missing then it is updated with closest position from y
.
library(dplyr)
df_avg <- mutate(df_avg, pos = as.character(pos))
df_player <- df_player %>%
as_tibble() %>%
mutate(id = 1:nrow(df_player),
pos = as.character(pos)) %>%
split(.$id) %>%
lapply(function(x, ref) {
y <- ref %>%
as_tibble() %>%
filter(year == x$year) %>%
mutate(diff = abs(ref[ref$year == x$year, ]$avg_height - as.numeric(x$height))) %>%
top_n(1, desc(diff))
mutate(x, pos = ifelse(is.na(pos), y$pos, pos))
}, ref = df_avg) %>%
bind_rows() %>%
select(-id)
This calculates and applies the means within lapply
.
library(dplyr)
df_player <- tibble(id = 1:100,
year = floor(runif(100,2000,2006)),
height = runif(100,70,85),
pos = sample(c("G","F","C",NA), size = 100, replace = TRUE))
df_player %>%
mutate(id = 1:nrow(df_player)) %>%
split(.$id) %>%
lapply(function(x, ref) {
y <- ref %>%
filter(year == x$year,
!is.na(pos)) %>%
group_by(pos) %>%
summarise(avg_height = mean(height, na.rm = TRUE)) %>%
mutate(diff = abs(avg_height - as.numeric(x$height))) %>%
top_n(1, desc(diff))
mutate(x, pos = ifelse(is.na(pos), y$pos, pos))
}, ref = df_player) %>%
bind_rows() %>%
select(-id)
Upvotes: 1