Reputation: 175
I have two sample data frames, df1
and df2
as given below.
df1
has the list of selected tennis match fixtures with player names(player1_name
,player_name2
) and the date they were played. Full names are used here for players.
df2
has the list of all tennis match results(winner
, loser
) for each date. Here, the first letter of first names and full last names are used.
The player names for fixtures and for results were scraped from different websites. So there could be some cases where last names may not exactly match.
Taking this into consideration, I would like to add a new column to df1
that says if player1 or player2 won. Basically, I would want to map player1_name
and player2_name
from df1
to winner and loser from df2 by some means of partial matching given the same date.
dput(df1)
structure(list(date = structure(c(18534, 18534, 18534, 18534,
18534, 18534, 18534), class = "Date"), player1_name = c("Laslo Djere",
"Hugo Dellien", "Quentin Halys", "Steve Johnson", "Henri Laaksonen",
"Thiago Monteiro", "Andrej Martin"), player2_name = c("Kevin Anderson",
"Ricardas Berankis", "Marcos Giron", "Roberto Carballes", "Pablo Cuevas",
"Nikoloz Basilashvili", "Joao Sousa")), row.names = c(NA, -7L
), class = "data.frame")
dput(df2)
structure(list(date = structure(c(18534, 18534, 18534, 18534,
18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534,
18534, 18534, 18534, 18534, 18534, 18534, 18534), class = "Date"),
winner = c("L Harris", "M Berrettini", "M Polmans", "C Garin",
"A Davidovich Fokina", "D Lajovic", "K Anderson", "R Berankis",
"M Giron", "A Rublev", "N Djokovic", "R Carballes Baena",
"A Balazs", "P Cuevas", "T Monteiro", "S Tsitsipas", "D Shapovalov",
"G Dimitrov", "R Bautista Agut", "A Martin"), loser = c("A Popyrin",
"V Pospisil", "U Humbert", "P Kohlschreiber", "H Mayot",
"G Mager", "L Djere", "H Dellien", "Q Halys", "S Querrey",
"M Ymer", "S Johnson", "Y Uchiyama", "H Laaksonen", "N Basilashvili",
"J Munar", "G Simon", "G Barrere", "R Gasquet", "J Sousa"
)), row.names = c(NA, -20L), class = "data.frame")
I have created a custom function that can match a string to it’s closest match from a string vector using RecordLinkage package. I could possibly write a super inefficient code using this function but before going there, I’d want to see if I can do it in a more efficient manner.
ClosestMatch <- function(string, stringVector,max_threshold=0.5) {
df<- character()
for (i in 1:length(string)) {
distance <- levenshteinSim(string[i], stringVector)
if (max(distance)>=max_threshold) {
df[i]<- stringVector[which.max(distance)]
}
else {
df[i]= NA
}
}
return(df)
}
Upvotes: 1
Views: 158
Reputation: 500
I gave it a go using stringdist
:
library(stringdist)
for (i in 1:nrow(df1)) {
#this first part combines the names of player1 and player2
#and finds the closest match to the player combinations in df2
d <-
stringdist(
paste(df1$player1_name[i], df1$player2_name[i]),
paste(df2$winner, df2$loser),
method = "cosine")
#I like using the cosine method as it returns a decimal as opposed to an integer
#then, added winner and loser columns to df1 based on which row in df2 had the closest match
#(i.e. lowest stringdist)
df1$winner[i] <- df2[which(d == min(d)), 2]
df1$loser[i] <- df2[which(d == min(d)), 3]
}
#adding another loop that makes the names in the winner/loser columns
#change to their closest match in the player1 and player2 columns
for(i in 1:nrow(df1)){
n <- stringdist(df1$winner[i], c(df1$player1_name[i], df1$player2_name[i]), method = "cosine")
if (n[1] > n[2]){df1$winner[i] <- df1$player2_name[i]
df1$loser[i] <- df1$player1_name[i]}
if (n[1] < n[2]){df1$winner[i] <- df1$player1_name[i]
df1$loser[i] <- df1$player2_name[i]}
}
> df1
date player1_name player2_name winner loser
1 2020-09-29 Laslo Djere Kevin Anderson Kevin Anderson Laslo Djere
2 2020-09-29 Hugo Dellien Ricardas Berankis Ricardas Berankis Hugo Dellien
3 2020-09-29 Quentin Halys Marcos Giron Marcos Giron Quentin Halys
4 2020-09-29 Steve Johnson Roberto Carballes Roberto Carballes Steve Johnson
5 2020-09-29 Henri Laaksonen Pablo Cuevas Pablo Cuevas Henri Laaksonen
6 2020-09-29 Thiago Monteiro Nikoloz Basilashvili Thiago Monteiro Nikoloz Basilashvili
7 2020-09-29 Andrej Martin Joao Sousa Andrej Martin Joao Sousa
Upvotes: 2