yash
yash

Reputation: 380

How to generate one hot encoded data for a set of columns with similar categories

I gathered a bunch of data from the internet to try and predict a sports outcome! But, now I find myself confused on how to prepare the data set for training. Basically, I have a DataFrame that looks as the follows and ofcourse a bunch of many more columns

            HomeTeam    AwayTeam       HTR   HF      AF  HomePlayStyle  AwayPlayStyle
Date                            
2014-08-16  Arsenal     Crystal Palace  D   13.0    19.0    4-1-4-1         4-2-3-1
2014-08-16  Leicester   Everton         A   16.0    10.0    4-4-2 double 6  4-2-3-1
2014-08-16  Man United  Swansea         A   14.0    20.0    3-5-2           3-5-2
2014-08-16  QPR         Hull City       D   10.0    10.0    5-3-2           5-4-1
2014-08-16  Stoke City  Aston Villa     D   14.0    9.0     4-2-3-1        4-3-3 Attacking

My dependent variable(what I need to predict) would be HTR(3 categories: D-Draw, A-Away wins, H-Home wins). But before training, since I need to prepare the dataset, I believe I need to use one-hot encoding to change the columns [HomeTeam, AwayTeam, HomePlayStyle, AwayPlayStyle] into zeros and ones. However, I have a couple of doubts regarding the approach:

  1. The HomePlayStyle and AwayPlayStyle have similar categories and when I use one hot encoding, the same playstyle(3-5-2, 3rd sample in the example) is creating two columns but technically they are the same. Would this influence my results? Or should I try to merge them or is there a way to get around this issue? and of course even 4-2-3-1 is present in both columns but pd.get_dummies() creates 2 columns.

  2. With HomeTeam and AwayTeam columns (I have a few temporal stats of these teams in different numerical columns but I believe I need to keep the team names in the dataset during training), am I supposed to one hot encode them? Inspite of creating two columns for the same team(for instance, HomeTeam_Arsenal and AwayTeam_Arsenal), I think there's an advantage here since playing at home is way different than playing away. So this shouldn't be an issue! Am I making the right assumptions? Do I even need to one-hot encode these set of columns?

Any thoughts would be really appreciated.

Edit: 3. How do I make sure my algorithm realizes that HomePlayStyle_4-2-3-1(after getting the dummies) infact represents the HomeTeam and not the AwayTeam? Is there such a thing as connected columns so that I could tell which set of columns belong to HomeTeam and which belong to the AwayTeam.

Upvotes: 0

Views: 721

Answers (1)

Juan C
Juan C

Reputation: 6132

The theroy part of your question seems more suited fore CrossValidate, so I'll briefly touch that subject. As to one-hot encoding, the easiest way to me is through pandas:

categorical_cols = ['HomeTeam', 'AwayTeam', 'HomePlayStyle', 'AwayPlayStyle']
X = pd.get_dummies(df, columns=categorical_cols)

This will create one column for each possible value in each of those columns with the format {column_name}_{column_value}, so you get columns like HomeTeam_Arsenal.

Problems with variables depend on the kind of models you're thinking of using. Multi-collinearity might be a problem in Logistic Regression, but not as much in, say, Random Forests. Also, never forget that business knowledge is very important, so if you know teams have differing winrates when playing home or away, well, you should include that in your model. If you're not sure, then test both options. Machine learning is a very iterative process, so don't be afraid to try many options.

Upvotes: 2

Related Questions