Reputation: 10163
I have a dataframe where each row in the data represents a matchup in a soccer game. Here is a summary with some columns removed, and only for 50 games of a season:
dput(mydata)
structure(list(home_id = c(75L, 323L, 607L, 3627L, 3645L, 641L,
204L, 111L, 287L, 179L, 1062L, 292L, 413L, 275L, 182L, 3639L,
179L, 2649L, 111L, 478L, 383L, 3645L, 275L, 577L, 3639L, 75L,
413L, 287L, 607L, 3627L, 1062L, 75L, 583L, 323L, 3736L, 577L,
179L, 287L, 275L, 3645L, 3639L, 583L, 179L, 413L, 641L, 204L,
478L, 292L, 607L, 323L), away_id = c(3645L, 3736L, 583L, 2649L,
577L, 75L, 3736L, 182L, 323L, 607L, 3639L, 583L, 478L, 383L,
3645L, 607L, 413L, 204L, 641L, 583L, 3627L, 179L, 182L, 3736L,
292L, 204L, 323L, 1062L, 2649L, 3639L, 204L, 292L, 111L, 607L,
182L, 3645L, 478L, 413L, 641L, 287L, 577L, 182L, 2649L, 1062L,
383L, 111L, 3736L, 3627L, 75L, 275L), home_rating = c(1546.64167937943,
1534.94287021653, 1514.51852002403, 1558.91823781777, 1555.76784458784,
1518.37707748967, 1464.5264202735, 1642.57388443639, 1447.37725553409,
1420.69724095008, 1428.51535356064, 1512.81896541907, 1463.29314217469,
1492.70306452585, 1404.65235407107, 1418.03767059747, 1420.69724095008,
1532.76811278441, 1642.57388443639, 1515.31896572792, 1498.7997953168,
1555.76784458784, 1492.70306452585, 1519.94395373088, 1418.03767059747,
1546.64167937943, 1463.29314217469, 1447.37725553409, 1514.51852002403,
1558.91823781777, 1428.51535356064, 1546.64167937943, 1524.71735294388,
1534.94287021653, 1484.09023843799, 1519.94395373088, 1420.69724095008,
1447.37725553409, 1492.70306452585, 1555.76784458784, 1418.03767059747,
1524.71735294388, 1420.69724095008, 1463.29314217469, 1518.37707748967,
1464.5264202735, 1515.31896572792, 1512.81896541907, 1514.51852002403,
1534.94287021653), away_rating = c(1555.76784458784, 1484.09023843799,
1524.71735294388, 1532.76811278441, 1519.94395373088, 1546.64167937943,
1484.09023843799, 1404.65235407107, 1534.94287021653, 1514.51852002403,
1418.03767059747, 1524.71735294388, 1515.31896572792, 1498.7997953168,
1555.76784458784, 1514.51852002403, 1463.29314217469, 1464.5264202735,
1518.37707748967, 1524.71735294388, 1558.91823781777, 1420.69724095008,
1404.65235407107, 1484.09023843799, 1512.81896541907, 1464.5264202735,
1534.94287021653, 1428.51535356064, 1532.76811278441, 1418.03767059747,
1464.5264202735, 1512.81896541907, 1642.57388443639, 1514.51852002403,
1404.65235407107, 1555.76784458784, 1515.31896572792, 1463.29314217469,
1518.37707748967, 1447.37725553409, 1519.94395373088, 1404.65235407107,
1532.76811278441, 1428.51535356064, 1498.7997953168, 1642.57388443639,
1484.09023843799, 1558.91823781777, 1546.64167937943, 1492.70306452585
)), .Names = c("home_id", "away_id", "home_rating", "away_rating"
), row.names = c(NA, 50L), class = "data.frame")
Heres what it looks like:
> head(mydata)
home_id away_id home_rating away_rating
1 75 3645 1546.642 1555.768
2 323 3736 1534.943 1484.090
3 607 583 1514.519 1524.717
4 3627 2649 1558.918 1532.768
5 3645 577 1555.768 1519.944
6 641 75 1518.377 1546.642
The columns home_rating and away_rating are scores that reflect how good each team is, and I'd like to use these columns in an apply function. In particular, I have another function named use_ratings() that looks like this:
# takes a rating from home and away team, as well as is_cup boolean, returns score
use_ratings <- function(home_rating, away_rating, is_cup = FALSE) {
if(is_cup) { # if is_cup, its a neutral site game
rating_diff <- -(home_rating - away_rating) / 400
} else {
rating_diff <- -(home_rating + 85 - away_rating) / 400
}
W_e <- 1 / (10^(rating_diff) + 1)
return(W_e)
}
I'd like to apply this function over every row my mydata, using the values in the home_rating and away_rating column as the parameters passed each time to use_ratings(). How can I do this, thanks?
Upvotes: 1
Views: 46
Reputation: 3290
@SymbolixAU is absolutely right in that the best way to do this (in terms of both speed and readability) is taking advantage of vectorization directly. But if you were to use an "apply function", that function would probably be mapply()
or apply()
:
Using mapply()
:
mapply(use_ratings, home_rating = mydata$home_rating,
away_rating = mydata$away_rating, is_cup = <a vector of booleans>)
Using apply()
:
apply(mydata, 1, function(row), use_ratings(row$home_rating, row$away_rating, <row$is_cup, which is missing>)
Multivariate apply (mapply
) simultaneously applies a multivariate function to several objects corresponding to its arguments. apply
applies a functions over the margins of matrix-like object. Setting MARGIN=1
asks apply
to operate on rows. Hence, we had to modify the function to operate on rows and feed the relevant arguments to use_ratings
.
Upvotes: 2