Reputation: 117
I'm using a Bradley-Terry model to model the outcome of tennis matches, and have encountered the following error. When I run:
library(BradleyTerry2)
matches <- read.csv("data/matches.csv")
model <- BTm(cbind(wins1,
wins2),
player1, player2, data=matches)
model
I get the error message:
Error in Diff(player1, player2, formula, id, data, separate.ability, refcat, :
'player1$..' and 'player2$..' must be factors with the same levels
The dataframe 'matches' has this format (Small reproducible example).
player1 | player2 | wins1 | wins2 |
---|---|---|---|
Agassi | Federer | 0 | 6 |
Agassi | Hewitt | 1 | 0 |
Agassi | Roddick | 1 | 0 |
Federer | Henman | 3 | 1 |
Federer | Hewitt | 9 | 0 |
Federer | Roddick | 5 | 0 |
Henman | Hewitt | 0 | 2 |
Henman | Roddick | 1 | 1 |
Hewitt | Roddick | 3 | 2 |
... and so on. Any name that appears in player1 will appear in player2.
I don't understand why the factors player1 and player2 have different levels? I've tried setting them to factors using as.factor
, but that didn't work. I also tried removing data=matches
and using matches$wins1
etc. as arguments to the BTm
function but that also didn't work. Now I'm a bit stuck, so any ideas are welcome!! Thank you :)
Upvotes: 0
Views: 103
Reputation: 117
With great help from the comments on the original post it has now been solved... Just needed to make sure that the players1 and players2 columns contain only the same players (this might mean switching some round in the data file), and then using as.factor()
around player1
and player2
.
Upvotes: 0
Reputation: 76402
Look at what you have in the 1st and 2nd columns.
Factors are internally coded as consecutive integers starting at 1. Below I unclass
each of them in order to get their internal representation.
player1
has two values, 1 and 2;player2
has two values, 1 and 2;player1
the level "Rafael Nadal"
is the 1st, its value is 1;player2
the level "Rafael Nadal"
is the 2nd, its value is 2.This is because each of the columns is a factor on its own, with no relation to the other column.
lapply(matches[1:2], unclass)
#> $player1
#> [1] 2 2 1
#> attr(,"levels")
#> [1] "Rafael Nadal" "Roger Federer"
#>
#> $player2
#> [1] 2 1 1
#> attr(,"levels")
#> [1] "Andy Murray" "Rafael Nadal"
Created on 2023-11-14
The solution is to get all of the unique values of all columns and use those unique values as levels when creating the factor.
In the code that follows the first instruction gets all unique values as character strings. Then creates factor columns with those strings as their levels.
lvls <- matches[1:2] |> unlist() |> as.character() |> unique()
matches[1:2] <- lapply(matches[1:2], factor, levels = lvls)
# check that now "Rafael Nadal" is always value 2
lapply(matches[1:2], unclass)
#> $player1
#> [1] 1 1 2
#> attr(,"levels")
#> [1] "Roger Federer" "Rafael Nadal" "Andy Murray"
#>
#> $player2
#> [1] 2 3 3
#> attr(,"levels")
#> [1] "Roger Federer" "Rafael Nadal" "Andy Murray"
Created on 2023-11-14
matches <- structure(list(
player1 = structure(c(2L, 2L, 1L), levels = c("Rafael Nadal", "Roger Federer"), class = "factor"),
player2 = structure(c(2L, 1L, 1L), levels = c("Andy Murray", "Rafael Nadal"), class = "factor"),
wins1 = c(3L, 5L, 4L), wins2 = c(2L, 2L, 3L)),
class = "data.frame", row.names = c(NA, -3L))
Created on 2023-11-14
In the mean time the example data set in the question has changed. Except for references to the players names the code above is still valid and solves the problem.
Upvotes: 0