Reputation: 13
Here's the example I'm working on:
processed_data = np.empty_like(data)
min_per_col = np.amin(data, axis=0) # axis0 for col, axis1 for row
max_per_col = np.amax(data, axis=0) # axis0 for col, axis1 for row
for row_idx, row in enumerate(data):
for col_idx, val in enumerate(row):
processed_data[row_idx][col_idx] = (val - min_per_col[col_idx]) / (max_per_col[col_idx] - min_per_col[col_idx])
data
is defined as a 2d numpy array. I am essentially trying to perform some operation on each element in data
using the relevant values in min_per_col
and max_per_col
.
I can't seem to figure out the approach to take. It seems like from these posts the answer is to reshape the arrays so that broadcasting works.
Intuitively, I think the way it would work with broadcasting would be:
# Results of min_per_col:
# [min1 min2 min3 min4 min5]
# Transformation to (call this 2d_min_per_col):
# [[min1 min2 min3 min4 min5],
# [min1 min2 min3 min4 min5],
# [min1 min2 min3 min4 min5]
# ...
# [min1 min2 min3 min4 min5]]
# which basically duplicates min_per_col into a 2d array form.
# Do the same for max (2d_max_per_col)
# processed_data = (data - 2d_min_per_col) / (2d_max_per_col - 2d_min_per_col)
Does this approach make sense? Or is there another answer for how to approach something like this?
Please let me know if there's anything else that would be helpful to include for this post! Thank you.
EDIT: Thanks for the help Mad Physicist! After trying this:
processed_data = np.empty_like(data)
min_per_col = np.amin(data, axis=0) # axis0 for col, axis1 for row
max_per_col = np.amax(data, axis=0) # axis0 for col, axis1 for row
for row_idx, row in enumerate(data):
for col_idx, val in enumerate(row):
processed_data[row_idx, col_idx] = (val - min_per_col[col_idx]) / (max_per_col[col_idx] - min_per_col[col_idx])
print("version 1\n", processed_data)
processed_data = (data - min_per_col) / (max_per_col - min_per_col)
print("version 2\n", processed_data)
return processed_data
It works identically, and is much faster!
version 1
[[0.25333333 0.13793103 0.14285714]
[0.32 0.79310345 0.92857143]
[0.13333333 0.48275862 0.51785714]
...
[0.28 0.4137931 0.125 ]
[0.01333333 0.24137931 0.75 ]
[0.08 0.20689655 0.23214286]]
version 2
[[0.25333333 0.13793103 0.14285714]
[0.32 0.79310345 0.92857143]
[0.13333333 0.48275862 0.51785714]
...
[0.28 0.4137931 0.125 ]
[0.01333333 0.24137931 0.75 ]
[0.08 0.20689655 0.23214286]]
Thanks for the fast help :D
Upvotes: 1
Views: 157
Reputation: 114440
You have the gist of it, but the whole point of broadcasting is that you don't need to expand arrays to do operations on them: the shapes are lined up on the right. So for example, let's say data.shape
is (M, N)
your array shapes look like this to the math operations:
data: (M, N)
processed_data: (M, N)
min_per_col: (N,)
max_per_col: (N,)
Notice that min_per_col
and max_per_col
line up perfectly as they should. That means that your entire loop becomes simply
processed_data = (data - min_per_col) / (max_per_col - min_per_col)
# (M, N) (N,)
# (M, N)
The comments under each operator show the shape of the broadcasted output.
As an aside, you can compute the denominator in a single step using np.ptp
:
processed_data = (data - np.min(data, axis=0)) / np.ptp(data, axis=0)
Upvotes: 1