Jacob
Jacob

Reputation: 659

Average between arrays of different length

I'm trying to develop a sort of very simple machine learning example to recognize similarity between arrays. For this reason I'm trying to calculate the average between 2 arrays with different length.

For example if I have:

array_1 = [0, 4, 5];
array_2 = [4, 2, 7];

The average is:

average_array = [2, 3, 6];

But how can I manage to calculate the average if I have the following situation:

array_1 = [0, 4, 5, 10, 7];
array_2 = [4, 2, 7];

As you can see the arrays have a different length. Is there an algorithm that I can apply to solve this problems? Does anyone have an idea or some suggestion?

Of course I can consider the missing values of the second array as 0, and evaluate the average as, for example:

average_array = [2, 3, 6, 5, 3.5];

or consider the values as "null" and have:

average_array = [2, 3, 6, 10, 7];

But are this two approach good? Or there is something smarter?

Thanks for your help!!

Upvotes: 0

Views: 1900

Answers (2)

Joris Schellekens
Joris Schellekens

Reputation: 9057

To answer your question, we really need more information on what you are trying to achieve.

I'm trying to develop a sort of very simple machine learning example to recognize similarity between arrays. For this reason I'm trying to calculate the average between 2 arrays with different length.

Depending on your usecase, similarity might be defined completely differently.

For instance:

  • if the array encodes sound-information you might want to measure similarity as "does this sound clip occur in this one" or "are the main frequencies (which would correspond to chords) the same"
  • if the array encodes image information (properly DFT-ed and zig-zag-encoded) you might not care about the low frequencies (end of the array) and only measure the difference between the first few values of the array
  • if the array encodes some kind of composition of elements (e.g. this essay contains keyword "matrix" 40 times, and keyword "SVM" 27 times) the difference in values might be very important.

General advice:

  1. Think about what you're measuring
  2. Decide what's important

But in general, have a look at smoothing algorithms. For instance Kneyser-Ney or Good-Turing smoothing. They explictly deal with comparing a vector of probabilities that may differ in length (in other words, have explicit zero entries)

https://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation

Upvotes: 2

Sumeet
Sumeet

Reputation: 8292

If after taking the the average of the arrays, you intend to take the mod of the difference of the array and the average array, then you are probably in the right direction if you will measure the dissimilarity by the magnitude of the difference.

But for arrays of different lengths I propose that you also take the index of extra elements in consideration.

For

array_1 = [0, 4, 5, 10, 7];
array_2 = [4, 2, 7];

average should be average_array = [2, 3, 6, 6.5, 5.5];

6.5 = (10 + 3(index) + 0(element) ) / 2

and

5.5 = (7 + 4(index) + 0(element))/2

Reason for taking index into consideration is that the length factor is also dealth with this approach. However this is just my 2 cents. May be there are better algorithms out there.

You should also take a look at this post

Upvotes: 0

Related Questions