Reputation: 659
I'm trying to develop a sort of very simple machine learning example to recognize similarity between arrays. For this reason I'm trying to calculate the average between 2 arrays with different length.
For example if I have:
array_1 = [0, 4, 5];
array_2 = [4, 2, 7];
The average is:
average_array = [2, 3, 6];
But how can I manage to calculate the average if I have the following situation:
array_1 = [0, 4, 5, 10, 7];
array_2 = [4, 2, 7];
As you can see the arrays have a different length. Is there an algorithm that I can apply to solve this problems? Does anyone have an idea or some suggestion?
Of course I can consider the missing values of the second array as 0, and evaluate the average as, for example:
average_array = [2, 3, 6, 5, 3.5];
or consider the values as "null" and have:
average_array = [2, 3, 6, 10, 7];
But are this two approach good? Or there is something smarter?
Thanks for your help!!
Upvotes: 0
Views: 1900
Reputation: 9057
To answer your question, we really need more information on what you are trying to achieve.
I'm trying to develop a sort of very simple machine learning example to recognize similarity between arrays. For this reason I'm trying to calculate the average between 2 arrays with different length.
Depending on your usecase, similarity might be defined completely differently.
For instance:
General advice:
But in general, have a look at smoothing algorithms. For instance Kneyser-Ney or Good-Turing smoothing. They explictly deal with comparing a vector of probabilities that may differ in length (in other words, have explicit zero entries)
https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation
Upvotes: 2
Reputation: 8292
If after taking the the average of the arrays, you intend to take the mod of the difference of the array and the average array, then you are probably in the right direction if you will measure the dissimilarity by the magnitude of the difference.
But for arrays of different lengths I propose that you also take the index of extra elements in consideration.
For
array_1 = [0, 4, 5, 10, 7];
array_2 = [4, 2, 7];
average should be average_array = [2, 3, 6, 6.5, 5.5];
6.5 = (10 + 3(index) + 0(element) ) / 2
and
5.5 = (7 + 4(index) + 0(element))/2
Reason for taking index into consideration is that the length factor is also dealth with this approach. However this is just my 2 cents. May be there are better algorithms out there.
You should also take a look at this post
Upvotes: 0