How to generate one hot encoded data for a set of columns with similar categories

Question

I gathered a bunch of data from the internet to try and predict a sports outcome! But, now I find myself confused on how to prepare the data set for training. Basically, I have a DataFrame that looks as the follows and ofcourse a bunch of many more columns

            HomeTeam    AwayTeam       HTR   HF      AF  HomePlayStyle  AwayPlayStyle
Date                            
2014-08-16  Arsenal     Crystal Palace  D   13.0    19.0    4-1-4-1         4-2-3-1
2014-08-16  Leicester   Everton         A   16.0    10.0    4-4-2 double 6  4-2-3-1
2014-08-16  Man United  Swansea         A   14.0    20.0    3-5-2           3-5-2
2014-08-16  QPR         Hull City       D   10.0    10.0    5-3-2           5-4-1
2014-08-16  Stoke City  Aston Villa     D   14.0    9.0     4-2-3-1        4-3-3 Attacking

My dependent variable(what I need to predict) would be HTR(3 categories: D-Draw, A-Away wins, H-Home wins). But before training, since I need to prepare the dataset, I believe I need to use one-hot encoding to change the columns [HomeTeam, AwayTeam, HomePlayStyle, AwayPlayStyle] into zeros and ones. However, I have a couple of doubts regarding the approach:

The HomePlayStyle and AwayPlayStyle have similar categories and when I use one hot encoding, the same playstyle(3-5-2, 3rd sample in the example) is creating two columns but technically they are the same. Would this influence my results? Or should I try to merge them or is there a way to get around this issue? and of course even 4-2-3-1 is present in both columns but pd.get_dummies() creates 2 columns.
With HomeTeam and AwayTeam columns (I have a few temporal stats of these teams in different numerical columns but I believe I need to keep the team names in the dataset during training), am I supposed to one hot encode them? Inspite of creating two columns for the same team(for instance, HomeTeam_Arsenal and AwayTeam_Arsenal), I think there's an advantage here since playing at home is way different than playing away. So this shouldn't be an issue! Am I making the right assumptions? Do I even need to one-hot encode these set of columns?

Any thoughts would be really appreciated.

Edit: 3. How do I make sure my algorithm realizes that HomePlayStyle_4-2-3-1(after getting the dummies) infact represents the HomeTeam and not the AwayTeam? Is there such a thing as connected columns so that I could tell which set of columns belong to HomeTeam and which belong to the AwayTeam.

How to generate one hot encoded data for a set of columns with similar categories

Answers (1)

Related Questions