blastoise
blastoise

Reputation: 119

Metric for calculating uniformity with penalty for lower values

In my data, I have multiple bill-dates and multiple items and each item is sold with a different amount each day.

I am looking for a metric to incorporate two things for this data.

  1. One is measuring how nearby or far apart dates on which a purchase for a particular item was made are: [0,1,0,0,0,0,1,1] is more uniform and [0,1,1,1,0,0,0,0] is less uniform where 1 represents a purchase was made on that day and 0 indicates purchase wasn't made.

Note I have many items like these, So I need a metric to arrange these items in order.

  1. Penalty for days when no purchase is made.

My final aim is to have a metric so that distribution of purchase on purchase dates are maximized and low number of total days purchased are penalized.

Now I tried two methods for this:

  1. wasserstein_distance also known as earth mover distance. Problem with this metric is it gives same value for wasserstein_distance([0,1,0,0,0,0,1,1], [1,1,1,1,1,1,1,1]) and wasserstein_distance([0,1,1,1,0,0,0,0], [1,1,1,1,1,1,1,1]). Also it doesn't penalizes presence of too many zeros.

  2. Entropy: Same problem of penalization.

Note I am also ready to incorporate instead the array of total quantity sold each day instead of a binary representation like the above.

Upvotes: 0

Views: 178

Answers (1)

StupidWolf
StupidWolf

Reputation: 46908

Not very clear in your question, but I guess you are more interested in how spread are the purchases, assuming a certain rate of success.

Distance measures calculate the overall difference between two vectors, and if the number of successes is approximately the same, you end up with the same distance, unsurprisingly.

so in the example you have given, we assume that the expected number of successes are the same. Then we can simply estimate the "waiting" times:

import numpy as np
ex1 = [0,1,0,0,0,0,1,1]
ex2 = [0,1,1,1,0,0,0,0]
 
np.mean(np.diff(np.where(ex1)[0]))
3.0

np.mean(np.diff(np.where(ex2)[0]))
1.0

So if you have the same number of successes, but the average waiting time is shorter, it is more clustered.

This is normally known as a poisson process for bernoulli trials. However, if you have more data, i.e a longer vector, and they are different probabilities of successes and spread, a quick way is to measure the dispersion of times between successes to judge how spread the successes are.

Below I simulate two types of distributions with the same rate of success:

np.random.seed(999)
ex1 = np.zeros(500)
ex1[np.cumsum(np.random.gamma(4,1,123)).astype(int)] = 1
ex2 = np.zeros(500)
ex2[np.cumsum(np.random.gamma(1.25,6,68)).astype(int)] = 1

You can see ex1 is more well spread or less clustered than ex2:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8,4))

ax.plot(np.where(ex1), [1]*len(np.where(ex1)), '|', color='k')
ax.plot(np.where(ex2), [2]*len(np.where(ex2)), '|', color='b')
fig.show()

enter image description here

We can calculate the coefficient of variance and ex2 has a higher value:

times_1 = np.diff(np.where(ex1))
np.std(times_1)/np.mean(times_1)
0.5221040055320324

times_2 = np.diff(np.where(ex2))
np.std(times_2)/np.mean(times_2)
0.8645205800519346

Upvotes: 0

Related Questions