NunNo
NunNo

Reputation: 35

Custom function to make duplicated values distinct based on a condition in R

Dataset can be downloaded from here

library(dplyr)
NBA <- read.csv("NBA Season Dataset/Seasons_Stats.csv")
NBA$Player <- as.character(NBA$Player)
PlayerData <- read.csv("NBA Season Dataset/player_data.csv")
PlayerData$name <- as.character(PlayerData$name)

I want to take their height & weight from PlayerData then combine with the main data NBA. The problem is this NBA players dataset contains some players share the same name with other players, so I need to make their names distinct before I merge two data frame together with merge by player's name.

PlayerData[duplicated(PlayerData$name), "name"] give me 50 duplicated names.

So I created a function which will rename player in both data frame based on the years they active:

unduplicate <- function(name, year_start, year_end, new_name) { 
    PlayerData[PlayerData$name == name & PlayerData$year_start == year_start & PlayerData$year_end == year_end, 1] = new_name
    NBA[NBA$Player == name & NBA$Year <= year_end & NBA$Year >= year_start, "Player"] = new_name
}

For an example, I take these two players who share the same name. DeeBrown1

And then call the function:

unduplicate("Dee Brown", 1991, 2002, "Dee Brown 1")
unduplicate("Dee Brown", 2007, 2009, "Dee Brown 2")

Nothing changed...

But if I do manually like this:

PlayerData[PlayerData$name == "Dee Brown" & PlayerData$year_start == 1991 & PlayerData$year_end == 2002, 1] = "Dee Brown 1"
NBA[NBA$Player == "Dee Brown" & NBA$Year <= 2002 & NBA$Year >= 1991, "Player"] = "Dee Brown 1"

PlayerData[PlayerData$name == "Dee Brown" & PlayerData$year_start == 2007 & PlayerData$year_end == 2009, 1] = "Dee Brown 2"
NBA[NBA$Player == "Dee Brown" & NBA$Year <= 2009 & NBA$Year >= 2007, "Player"] = "Dee Brown 2"

Then come to the desired result: DeeBrown2

So my question is,

1) What's wrong with the function? I checked and tried many variations didn't work.

2) Is there any way better to solve this problem?

I'm kinda new to this, so forgive me if this just stupid beginner's mistake.

Thanks!

Upvotes: 0

Views: 45

Answers (1)

Nar
Nar

Reputation: 658

You can use distinct from dplyr to select unique players based on set of variables. Sqldf library gives possibility to merge tables based on condition with inequality:

library(dplyr)                 
player_data <- read.csv("player_data.csv", stringsAsFactors = F)
Players <- read.csv("Players.csv", stringsAsFactors = F)
NBA1<-  read.csv("Seasons_Stats.csv", stringsAsFactors = F)

Dist_players <-player_data%>%
  distinct(name, year_start, year_end, height,  weight )

library(sqldf)
Final <- sqldf("SELECT * FROM NBA1 JOIN Dist_players ON NBA1.Player = Dist_players.name 
      WHERE NBA1.Year >= Dist_players.year_start AND NBA1.Year <= Dist_players.year_end")

Upvotes: 1

Related Questions