Custom function to make duplicated values distinct based on a condition in R

Question

Dataset can be downloaded from here

library(dplyr)
NBA <- read.csv("NBA Season Dataset/Seasons_Stats.csv")
NBA$Player <- as.character(NBA$Player)
PlayerData <- read.csv("NBA Season Dataset/player_data.csv")
PlayerData$name <- as.character(PlayerData$name)

I want to take their height & weight from PlayerData then combine with the main data NBA. The problem is this NBA players dataset contains some players share the same name with other players, so I need to make their names distinct before I merge two data frame together with merge by player's name.

PlayerData[duplicated(PlayerData$name), "name"] give me 50 duplicated names.

So I created a function which will rename player in both data frame based on the years they active:

unduplicate <- function(name, year_start, year_end, new_name) { 
    PlayerData[PlayerData$name == name & PlayerData$year_start == year_start & PlayerData$year_end == year_end, 1] = new_name
    NBA[NBA$Player == name & NBA$Year <= year_end & NBA$Year >= year_start, "Player"] = new_name
}

For an example, I take these two players who share the same name.

And then call the function:

unduplicate("Dee Brown", 1991, 2002, "Dee Brown 1")
unduplicate("Dee Brown", 2007, 2009, "Dee Brown 2")

Nothing changed...

But if I do manually like this:

PlayerData[PlayerData$name == "Dee Brown" & PlayerData$year_start == 1991 & PlayerData$year_end == 2002, 1] = "Dee Brown 1"
NBA[NBA$Player == "Dee Brown" & NBA$Year <= 2002 & NBA$Year >= 1991, "Player"] = "Dee Brown 1"

PlayerData[PlayerData$name == "Dee Brown" & PlayerData$year_start == 2007 & PlayerData$year_end == 2009, 1] = "Dee Brown 2"
NBA[NBA$Player == "Dee Brown" & NBA$Year <= 2009 & NBA$Year >= 2007, "Player"] = "Dee Brown 2"

Then come to the desired result:

So my question is,

1) What's wrong with the function? I checked and tried many variations didn't work.

2) Is there any way better to solve this problem?

I'm kinda new to this, so forgive me if this just stupid beginner's mistake.

Thanks!

Nar · Accepted Answer

You can use distinct from dplyr to select unique players based on set of variables. Sqldf library gives possibility to merge tables based on condition with inequality:

library(dplyr)                 
player_data <- read.csv("player_data.csv", stringsAsFactors = F)
Players <- read.csv("Players.csv", stringsAsFactors = F)
NBA1<-  read.csv("Seasons_Stats.csv", stringsAsFactors = F)

Dist_players <-player_data%>%
  distinct(name, year_start, year_end, height,  weight )

library(sqldf)
Final <- sqldf("SELECT * FROM NBA1 JOIN Dist_players ON NBA1.Player = Dist_players.name 
      WHERE NBA1.Year >= Dist_players.year_start AND NBA1.Year <= Dist_players.year_end")

Custom function to make duplicated values distinct based on a condition in R

Answers (1)

Related Questions