Reputation: 119
In my data, I have multiple bill-dates and multiple items and each item is sold with a different amount each day.
I am looking for a metric to incorporate two things for this data.
[0,1,0,0,0,0,1,1]
is more uniform and [0,1,1,1,0,0,0,0]
is less uniform where 1 represents a purchase was made on that day and 0 indicates purchase wasn't made.Note I have many items like these, So I need a metric to arrange these items in order.
My final aim is to have a metric so that distribution of purchase on purchase dates are maximized and low number of total days purchased are penalized.
Now I tried two methods for this:
wasserstein_distance also known as earth mover distance.
Problem with this metric is it gives same value for wasserstein_distance([0,1,0,0,0,0,1,1], [1,1,1,1,1,1,1,1])
and wasserstein_distance([0,1,1,1,0,0,0,0], [1,1,1,1,1,1,1,1])
. Also it doesn't penalizes presence of too many zeros.
Entropy: Same problem of penalization.
Note I am also ready to incorporate instead the array of total quantity sold each day instead of a binary representation like the above.
Upvotes: 0
Views: 178
Reputation: 46908
Not very clear in your question, but I guess you are more interested in how spread are the purchases, assuming a certain rate of success.
Distance measures calculate the overall difference between two vectors, and if the number of successes is approximately the same, you end up with the same distance, unsurprisingly.
so in the example you have given, we assume that the expected number of successes are the same. Then we can simply estimate the "waiting" times:
import numpy as np
ex1 = [0,1,0,0,0,0,1,1]
ex2 = [0,1,1,1,0,0,0,0]
np.mean(np.diff(np.where(ex1)[0]))
3.0
np.mean(np.diff(np.where(ex2)[0]))
1.0
So if you have the same number of successes, but the average waiting time is shorter, it is more clustered.
This is normally known as a poisson process for bernoulli trials. However, if you have more data, i.e a longer vector, and they are different probabilities of successes and spread, a quick way is to measure the dispersion of times between successes to judge how spread the successes are.
Below I simulate two types of distributions with the same rate of success:
np.random.seed(999)
ex1 = np.zeros(500)
ex1[np.cumsum(np.random.gamma(4,1,123)).astype(int)] = 1
ex2 = np.zeros(500)
ex2[np.cumsum(np.random.gamma(1.25,6,68)).astype(int)] = 1
You can see ex1 is more well spread or less clustered than ex2:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8,4))
ax.plot(np.where(ex1), [1]*len(np.where(ex1)), '|', color='k')
ax.plot(np.where(ex2), [2]*len(np.where(ex2)), '|', color='b')
fig.show()
We can calculate the coefficient of variance and ex2 has a higher value:
times_1 = np.diff(np.where(ex1))
np.std(times_1)/np.mean(times_1)
0.5221040055320324
times_2 = np.diff(np.where(ex2))
np.std(times_2)/np.mean(times_2)
0.8645205800519346
Upvotes: 0