Ollie
Ollie

Reputation: 117

Factors have different levels but I don't understand why

I'm using a Bradley-Terry model to model the outcome of tennis matches, and have encountered the following error. When I run:

library(BradleyTerry2)

matches <- read.csv("data/matches.csv")
model <- BTm(cbind(wins1,
                   wins2),
             player1, player2, data=matches)
model

I get the error message:

Error in Diff(player1, player2, formula, id, data, separate.ability, refcat,  : 
  'player1$..' and 'player2$..' must be factors with the same levels

The dataframe 'matches' has this format (Small reproducible example).

player1 player2 wins1 wins2
Agassi Federer 0 6
Agassi Hewitt 1 0
Agassi Roddick 1 0
Federer Henman 3 1
Federer Hewitt 9 0
Federer Roddick 5 0
Henman Hewitt 0 2
Henman Roddick 1 1
Hewitt Roddick 3 2

... and so on. Any name that appears in player1 will appear in player2.

I don't understand why the factors player1 and player2 have different levels? I've tried setting them to factors using as.factor, but that didn't work. I also tried removing data=matches and using matches$wins1 etc. as arguments to the BTm function but that also didn't work. Now I'm a bit stuck, so any ideas are welcome!! Thank you :)

Upvotes: 0

Views: 103

Answers (2)

Ollie
Ollie

Reputation: 117

With great help from the comments on the original post it has now been solved... Just needed to make sure that the players1 and players2 columns contain only the same players (this might mean switching some round in the data file), and then using as.factor() around player1 and player2.

Upvotes: 0

Rui Barradas
Rui Barradas

Reputation: 76402

Look at what you have in the 1st and 2nd columns.
Factors are internally coded as consecutive integers starting at 1. Below I unclass each of them in order to get their internal representation.

  • player1 has two values, 1 and 2;
  • player2 has two values, 1 and 2;
  • in player1 the level "Rafael Nadal" is the 1st, its value is 1;
  • but in player2 the level "Rafael Nadal" is the 2nd, its value is 2.

This is because each of the columns is a factor on its own, with no relation to the other column.

lapply(matches[1:2], unclass)
#> $player1
#> [1] 2 2 1
#> attr(,"levels")
#> [1] "Rafael Nadal"  "Roger Federer"
#> 
#> $player2
#> [1] 2 1 1
#> attr(,"levels")
#> [1] "Andy Murray"  "Rafael Nadal"

Created on 2023-11-14

The solution is to get all of the unique values of all columns and use those unique values as levels when creating the factor.

In the code that follows the first instruction gets all unique values as character strings. Then creates factor columns with those strings as their levels.

lvls <- matches[1:2] |> unlist() |> as.character() |> unique()
matches[1:2] <- lapply(matches[1:2], factor, levels = lvls)

# check that now "Rafael Nadal" is always value 2
lapply(matches[1:2], unclass)
#> $player1
#> [1] 1 1 2
#> attr(,"levels")
#> [1] "Roger Federer" "Rafael Nadal"  "Andy Murray"  
#> 
#> $player2
#> [1] 2 3 3
#> attr(,"levels")
#> [1] "Roger Federer" "Rafael Nadal"  "Andy Murray"

Created on 2023-11-14


Data

matches <- structure(list(
  player1 = structure(c(2L, 2L, 1L), levels = c("Rafael Nadal", "Roger Federer"), class = "factor"), 
  player2 = structure(c(2L, 1L, 1L), levels = c("Andy Murray", "Rafael Nadal"), class = "factor"), 
  wins1 = c(3L, 5L, 4L), wins2 = c(2L, 2L, 3L)), 
  class = "data.frame", row.names = c(NA, -3L))

Created on 2023-11-14


Edit

In the mean time the example data set in the question has changed. Except for references to the players names the code above is still valid and solves the problem.

Upvotes: 0

Related Questions