Ahmad Odeh
Ahmad Odeh

Reputation: 148

In python with numpy, how can I update array from another array depend on column that exists in both?

So I have a source array like this:

 [[  9  85  32 100]
 [  7  80  30 100]
 [  2  90  16 100]
 [  6 120  22 100]
 [  5 105  17 100]
 [  0 100  33 100]
 [  3 110  22 100]
 [  4  80  22 100]
 [  8 115  19 100]
 [  1  95  28 100]]

and I want to update the array with this one, depend on the first column

[[  3 110  22 105]
 [  5 105  17 110]
 [  1  95  28 115]]

to be like this

 [[  9  85  32 100]
 [  7  80  30 100]
 [  2  90  16 100]
 [  6 120  22 100]
 [  5 105  17 110]
 [  0 100  33 100]
 [  3 110  22 105]
 [  4  80  22 100]
 [  8 115  19 100]
 [  1  95  28 115]]

but I can't find a function in NumPy can do this directly, so currently have no way to do that better than this method I wrote:

def update_ary_with_ary(source, updates):
    for x in updates:
        index_of_col = np.argwhere(source[:,0] == x[0])
        source[index_of_col] = x

This function makes a loop so it's not professional and not have high performance so I will use this until some-one give me a better way with NumPy laps, I don't want a solution from another lap, just Numpy

Upvotes: 0

Views: 153

Answers (2)

fountainhead
fountainhead

Reputation: 3722

Assuming your source array is s and update array is u, and assuming that s and u are not huge, you can do:

update_row_ids = np.nonzero(s[:,0] == u[:,0].reshape(-1,1))[1]
s[update_row_ids] = u

Testing:

import numpy as np
s = np.array(
    [[  9,  85,  32, 100],
     [  7,  80,  30, 100],
     [  2,  90,  16, 100],
     [  6, 120,  22, 100],
     [  5, 105,  17, 100],
     [  0, 100,  33, 100],
     [  3, 110,  22, 100],
     [  4,  80,  22, 100],
     [  8, 115,  19, 100],
     [  1,  95,  28, 100]])
u = np.array(
    [[  3, 110,  22, 105],
     [  5, 105,  17, 110],
     [  1,  95,  28, 115]])

update_row_ids = np.nonzero(s[:,0] == u[:,0].reshape(-1,1))[1]
s[update_row_ids] = u

print(s)

This prints:

[[  9  85  32 100]
 [  7  80  30 100]
 [  2  90  16 100]
 [  6 120  22 100]
 [  5 105  17 110]
 [  0 100  33 100]
 [  3 110  22 105]
 [  4  80  22 100]
 [  8 115  19 100]
 [  1  95  28 115]]

Edit: OP has provided the following additional details:

  • The "source array" is "huge".
  • Each row in the "update array" matches exactly one row in the "source array".

Based on this additional detail, the following alternative solution might provide a better performance, especially if the source array does not have its rows sorted on the first column:

sorted_idx = np.argsort(s[:,0])
pos = np.searchsorted(s[:,0],u[:,0],sorter=sorted_idx)
update_row_ids = sorted_idx[pos]

s[update_row_ids] = u

Upvotes: 1

Ahmad Odeh
Ahmad Odeh

Reputation: 148

fountainhead your answer works correctly and yes it's full used Numpy laps, but in the performance test, it's rise the time on processing 50K rows in my simulation program in double!! from 22 seconds to 44 seconds!! I don't know why!! but your answer helps me to get the right answer on only this line:

source[updates[:,0]] = updates
# or 
s[u[:,0]] = u

so when I use this its lower processing time from for 100K rows to only 0.5 seconds and then let me process more like 1M rows for only 5 seconds, am already learning python and data mining am shocked from these numbers, it's never happing before on other languages I play on the huge array like regular variables. you can see that on my GitHub.

https://github.com/qahmad81/war_simulation

fountainhead you should take the answer but visited should know the best answer to use.

Upvotes: 0

Related Questions