Optimize Sum of Squared Differences (SSD) in numpy

Question

I'm trying to optimise expected goal in football (soccer) matches by measuring the sum of squared difference against the individual match timeslots. Assuming each match is divided into k number of Timeslots with constant probabilities of a Goal scored by either team or No Goal.

**Sample SSD for individual match_i with Final score [0-0]**
xG is unique in each match. 
Team1 and Team2 has the following xG multiplied by arbitrary multiplier M.

Team1 = xG_1*M
Team2 = xG_2*M
prob_1 = [1-(xG_1 + xG_2)/k, xG_1/k, xG_2/k].

where Prob_1 is a constant probability of a Draw, a Team1 Goal or a Team2 Goal for each Timeslot (k) per match_i where sum(prob_1) = 1.

To measure SSD for match_i.

x1 = [1,0,0] #; prob. of No goal scored per timeslot.
x2 = [0,1,0] #; prob. of Home Team scoring per timeslot.
x3 = [0,0,1] #; prob. of Away Team scoring per timeslot.    
y  = np.array([1-(xG_1 + xG_2)/k, xG_1/k, xG_2/k])
#    Using xG_Team1 and xG_Team2 from table below.

total_timeslot = 180 
Home_Goal = [] # No Goal scored
Away_Goal = [] # Np Goal scored

def sum_squared_diff(x1, x2, x3, y):
    ssd=[]
    for k in range(total_timeslot):
        if k in Home_Goal:
            ssd.append( sum((x2 - y)**2))
        elif k in Away_Goal:
            ssd.append(sum((x3 - y)**2))
        else:
            ssd.append(sum((x1 - y)**2))
    return ssd


SSD_Result =  sum_squared_diff(x1, x2, x3, y)
sum(SSD_Result)

For example, using xGs of index 0 from the table below and M = 1

First, for k = 187 timeslot, xG per timeslot becomes    1.4405394105672238/187, 1.3800950382265837/187 
and are constant throughout the match. 
y_0  = np.array([1-(0.007703419308 + 0.007380187370)/187, 0.007703419308/187, 0.007380187370/187])
Using y_0 in the function above, 
SSD_Result for xG at index 0 is  1.8252675137316426e-06.

As SSD goes this looks promising but then again the match ended goalless and the two teams has almost identical xG figure.... Now I want to apply the same procedure to xG index 1, xG index 2....xG index 10000. Then take the total SSD and depending on the value, change the arbitrary multiplier M until best result is achieved.

**Question **

How can I convert the xG in each match to prob_1 like array and call it into the function above?
i.e. prob_1...prob_10000. Here's sample of xG. 

individual_match_xG.tail()
     xG_Team1  xG_Team2
0  1.440539  1.380095
1  2.123673  0.946116
2  1.819697  0.921660
3  1.132676  1.375717
4  1.244837  1.269933

So in conclusion,

* There are 10000 Final Score's with xG that I want to turn into 10000 prob_1. Then get an SSD for each. 
* K is Total timeslote per match and is constant depending on the length of the intervals. For 30 sec timeslots, k is 180. Plus 7/2 mints of injuy time, k=187. 
* Home_Goal, Away_Goal and No_Goal  represents the prob. of a single goal scored per timeslot by the respective Team or No goal being scored. 
* Only one Goal can be scored per timeslot.

wwii · Accepted Answer

import numpy as np
# constants
M = 1.0
k = 180    # number of timeslots
x1 = [1,0,0] # prob. of No goal scored per timeslot.
x2 = [0,1,0] # prob. of Home Team scoring per timeslot.
x3 = [0,0,1] # prob. of Away Team scoring per timeslot.    

# seven scores
final_scores = [[2,1],[3,3],[1,2],[1,1],[2,1],[4,0],[2,3]]

# time slots with goals
Home_Goal = [2, 3]
Away_Goal = [4]

# numpy arrays of the data
final_scores = np.array(final_scores)    # team_1 is [:,0], team_2 is [:,1]
home_goal = np.array(Home_Goal)
away_goal = np.array(Away_Goal)

# fudge factor
adj_scores = final_scores * M    # shape --> (# of scores, 2)

# calculate prob_1
slot_goal_probability = adj_scores / k    # xG_n / k
slot_draw_probability = 1 - slot_goal_probability.sum(axis = 1)    #1-(xG_1+xG_2)/k

# y for all scores
y = np.concatenate((slot_draw_probability[:,None], slot_goal_probability), axis=1)


# ssd for x2, x3, x1
home_ssd = np.sum(np.square(x2 - y), axis=1)
away_ssd = np.sum(np.square(x3 - y), axis=1)
draw_ssd = np.sum(np.square(x1 - y), axis=1)

ssd = np.zeros((y.shape[0],k))
ssd += draw_ssd[:,None]    # all time slices a draw
ssd[:,home_goal] = home_ssd[:,None]    # time slots with goal for home games 
ssd[:,away_goal] = away_ssd[:,None]    # time slots with goal for away games

Sum of probabilities (prob_1 in your example) for each score:

>>> y.sum(axis=1)
array([1., 1., 1., 1., 1., 1., 1.])

ssd's shape is (# of scores,180) - it holds the time slot probability for all the scores.

>>> ssd.sum(axis=1)
array([5.92222222, 6.        , 5.93333333, 5.93333333, 5.92222222,
       5.95555556, 5.96666667])
>>> for thing in ssd.sum(axis=1):
    print(thing)

5.922222222222222
6.000000000000001
5.933333333333332
5.933333333333337
5.922222222222222
5.955555555555557
5.966666666666663
>>>

Test y with your function:

>>> y
array([[0.98333333, 0.01111111, 0.00555556],
       [0.96666667, 0.01666667, 0.01666667],
       [0.98333333, 0.00555556, 0.01111111],
       [0.98888889, 0.00555556, 0.00555556],
       [0.98333333, 0.01111111, 0.00555556],
       [0.97777778, 0.02222222, 0.        ],
       [0.97222222, 0.01111111, 0.01666667]])
>>> for prob in y:
    print(sum(sum_squared_diff(prob, x1, x2, x3)))

5.922222222222252
6.000000000000045
5.933333333333363
5.933333333333391
5.922222222222252
5.955555555555599
5.966666666666613
>>>

Some, hopefully, minor differences. I'll put them down to floating point or rounding errors in the 1e-14 range.

Maybe someone will see this and fine tune it a bit with more optimizations in their own answer. Once I worked it out I didn't look for further improvements.

Numpy Basics:
Indexing
Broadcasting

Optimize Sum of Squared Differences (SSD) in numpy

Answers (1)

Related Questions