Canovice
Canovice

Reputation: 10163

Which apply function in R to use for my calculations

I have a dataframe where each row in the data represents a matchup in a soccer game. Here is a summary with some columns removed, and only for 50 games of a season:

dput(mydata)
structure(list(home_id = c(75L, 323L, 607L, 3627L, 3645L, 641L, 
204L, 111L, 287L, 179L, 1062L, 292L, 413L, 275L, 182L, 3639L, 
179L, 2649L, 111L, 478L, 383L, 3645L, 275L, 577L, 3639L, 75L, 
413L, 287L, 607L, 3627L, 1062L, 75L, 583L, 323L, 3736L, 577L, 
179L, 287L, 275L, 3645L, 3639L, 583L, 179L, 413L, 641L, 204L, 
478L, 292L, 607L, 323L), away_id = c(3645L, 3736L, 583L, 2649L, 
577L, 75L, 3736L, 182L, 323L, 607L, 3639L, 583L, 478L, 383L, 
3645L, 607L, 413L, 204L, 641L, 583L, 3627L, 179L, 182L, 3736L, 
292L, 204L, 323L, 1062L, 2649L, 3639L, 204L, 292L, 111L, 607L, 
182L, 3645L, 478L, 413L, 641L, 287L, 577L, 182L, 2649L, 1062L, 
383L, 111L, 3736L, 3627L, 75L, 275L), home_rating = c(1546.64167937943, 
1534.94287021653, 1514.51852002403, 1558.91823781777, 1555.76784458784, 
1518.37707748967, 1464.5264202735, 1642.57388443639, 1447.37725553409, 
1420.69724095008, 1428.51535356064, 1512.81896541907, 1463.29314217469, 
1492.70306452585, 1404.65235407107, 1418.03767059747, 1420.69724095008, 
1532.76811278441, 1642.57388443639, 1515.31896572792, 1498.7997953168, 
1555.76784458784, 1492.70306452585, 1519.94395373088, 1418.03767059747, 
1546.64167937943, 1463.29314217469, 1447.37725553409, 1514.51852002403, 
1558.91823781777, 1428.51535356064, 1546.64167937943, 1524.71735294388, 
1534.94287021653, 1484.09023843799, 1519.94395373088, 1420.69724095008, 
1447.37725553409, 1492.70306452585, 1555.76784458784, 1418.03767059747, 
1524.71735294388, 1420.69724095008, 1463.29314217469, 1518.37707748967, 
1464.5264202735, 1515.31896572792, 1512.81896541907, 1514.51852002403, 
1534.94287021653), away_rating = c(1555.76784458784, 1484.09023843799, 
1524.71735294388, 1532.76811278441, 1519.94395373088, 1546.64167937943, 
1484.09023843799, 1404.65235407107, 1534.94287021653, 1514.51852002403, 
1418.03767059747, 1524.71735294388, 1515.31896572792, 1498.7997953168, 
1555.76784458784, 1514.51852002403, 1463.29314217469, 1464.5264202735, 
1518.37707748967, 1524.71735294388, 1558.91823781777, 1420.69724095008, 
1404.65235407107, 1484.09023843799, 1512.81896541907, 1464.5264202735, 
1534.94287021653, 1428.51535356064, 1532.76811278441, 1418.03767059747, 
1464.5264202735, 1512.81896541907, 1642.57388443639, 1514.51852002403, 
1404.65235407107, 1555.76784458784, 1515.31896572792, 1463.29314217469, 
1518.37707748967, 1447.37725553409, 1519.94395373088, 1404.65235407107, 
1532.76811278441, 1428.51535356064, 1498.7997953168, 1642.57388443639, 
1484.09023843799, 1558.91823781777, 1546.64167937943, 1492.70306452585
)), .Names = c("home_id", "away_id", "home_rating", "away_rating"
), row.names = c(NA, 50L), class = "data.frame")

Heres what it looks like:

> head(mydata)
  home_id away_id home_rating away_rating
1      75    3645    1546.642    1555.768
2     323    3736    1534.943    1484.090
3     607     583    1514.519    1524.717
4    3627    2649    1558.918    1532.768
5    3645     577    1555.768    1519.944
6     641      75    1518.377    1546.642

The columns home_rating and away_rating are scores that reflect how good each team is, and I'd like to use these columns in an apply function. In particular, I have another function named use_ratings() that looks like this:

# takes a rating from home and away team, as well as is_cup boolean, returns score
use_ratings <- function(home_rating, away_rating, is_cup = FALSE) {
  if(is_cup) { # if is_cup, its a neutral site game
    rating_diff <- -(home_rating - away_rating) / 400
  } else {
    rating_diff <- -(home_rating + 85 - away_rating) / 400
  }

  W_e <- 1 / (10^(rating_diff) + 1) 
  return(W_e)
} 

I'd like to apply this function over every row my mydata, using the values in the home_rating and away_rating column as the parameters passed each time to use_ratings(). How can I do this, thanks?

Upvotes: 1

Views: 46

Answers (1)

nothing
nothing

Reputation: 3290

@SymbolixAU is absolutely right in that the best way to do this (in terms of both speed and readability) is taking advantage of vectorization directly. But if you were to use an "apply function", that function would probably be mapply() or apply():

Using mapply():

mapply(use_ratings, home_rating = mydata$home_rating, away_rating = mydata$away_rating, is_cup = <a vector of booleans>)

Using apply():

apply(mydata, 1, function(row), use_ratings(row$home_rating, row$away_rating, <row$is_cup, which is missing>)

Multivariate apply (mapply) simultaneously applies a multivariate function to several objects corresponding to its arguments. apply applies a functions over the margins of matrix-like object. Setting MARGIN=1 asks apply to operate on rows. Hence, we had to modify the function to operate on rows and feed the relevant arguments to use_ratings.

Upvotes: 2

Related Questions