Griffin Beels
Griffin Beels

Reputation: 13

Replace double for loop (with indexing into other arrays) over numpy array

Here's the example I'm working on:

 processed_data = np.empty_like(data)
 min_per_col = np.amin(data, axis=0) # axis0 for col, axis1 for row
 max_per_col = np.amax(data, axis=0) # axis0 for col, axis1 for row
 for row_idx, row in enumerate(data):
     for col_idx, val in enumerate(row):
         processed_data[row_idx][col_idx] = (val - min_per_col[col_idx]) / (max_per_col[col_idx] - min_per_col[col_idx])

data is defined as a 2d numpy array. I am essentially trying to perform some operation on each element in data using the relevant values in min_per_col and max_per_col.

I can't seem to figure out the approach to take. It seems like from these posts the answer is to reshape the arrays so that broadcasting works.

Intuitively, I think the way it would work with broadcasting would be:

# Results of min_per_col: 
#     [min1 min2 min3 min4 min5]

# Transformation to (call this 2d_min_per_col):
#     [[min1 min2 min3 min4 min5],
#      [min1 min2 min3 min4 min5],
#      [min1 min2 min3 min4 min5]
#      ...
#      [min1 min2 min3 min4 min5]]
# which basically duplicates min_per_col into a 2d array form.

# Do the same for max (2d_max_per_col)

# processed_data = (data - 2d_min_per_col) / (2d_max_per_col - 2d_min_per_col)

Does this approach make sense? Or is there another answer for how to approach something like this?

Please let me know if there's anything else that would be helpful to include for this post! Thank you.

EDIT: Thanks for the help Mad Physicist! After trying this:

processed_data = np.empty_like(data)
min_per_col = np.amin(data, axis=0) # axis0 for col, axis1 for row
max_per_col = np.amax(data, axis=0) # axis0 for col, axis1 for row
for row_idx, row in enumerate(data):
    for col_idx, val in enumerate(row):
        processed_data[row_idx, col_idx] = (val - min_per_col[col_idx]) / (max_per_col[col_idx] - min_per_col[col_idx])
print("version 1\n", processed_data)

processed_data = (data - min_per_col) / (max_per_col - min_per_col)
print("version 2\n", processed_data)

return processed_data

It works identically, and is much faster!

version 1
 [[0.25333333 0.13793103 0.14285714]
 [0.32       0.79310345 0.92857143]
 [0.13333333 0.48275862 0.51785714]
 ...
 [0.28       0.4137931  0.125     ]
 [0.01333333 0.24137931 0.75      ]
 [0.08       0.20689655 0.23214286]]
version 2
 [[0.25333333 0.13793103 0.14285714]
 [0.32       0.79310345 0.92857143]
 [0.13333333 0.48275862 0.51785714]
 ...
 [0.28       0.4137931  0.125     ]
 [0.01333333 0.24137931 0.75      ]
 [0.08       0.20689655 0.23214286]]

Thanks for the fast help :D

Upvotes: 1

Views: 157

Answers (1)

Mad Physicist
Mad Physicist

Reputation: 114440

You have the gist of it, but the whole point of broadcasting is that you don't need to expand arrays to do operations on them: the shapes are lined up on the right. So for example, let's say data.shape is (M, N) your array shapes look like this to the math operations:

data:           (M, N)
processed_data: (M, N)
min_per_col:       (N,)
max_per_col:       (N,)

Notice that min_per_col and max_per_col line up perfectly as they should. That means that your entire loop becomes simply

processed_data = (data - min_per_col) / (max_per_col - min_per_col)
#                    (M, N)                         (N,)
#                                   (M, N)

The comments under each operator show the shape of the broadcasted output.

As an aside, you can compute the denominator in a single step using np.ptp:

processed_data = (data - np.min(data, axis=0)) / np.ptp(data, axis=0)

Upvotes: 1

Related Questions