Prajjwal Das
Prajjwal Das

Reputation: 13

If I append an element from one 2D numpy array to a list and then modify the original element, why does the list change as well?

I'm trying to Tim sort an hdf5 file. I converted the data in the hdf5 file into a 2D numpy array, and I'm trying to sort it through a stable algorithm. The insertion step goes well.

However, when I checked by debugging, in the merge step, when the left or right array (arrays holding the pre-sorted values) elements are assigned to the original array one by one, the left and right arrays are also changed, which messes up the rest of the sorting as only one element is repeated throughout that entire chunk.

This is what I'm working with:

from collections import deque
import h5py
import sys

filename = sys.argv[1]

with h5py.File(filename, "r") as f:
    a_group_key = list(f.keys())[0]
    ds_arr = f[a_group_key][()]

MINIMUM= 32
  
def find_minrun(n): 
  
    r = 0
    while n >= MINIMUM: 
        r |= n & 1
        n >>= 1
    return n + r 
  
def insertion_sort(array, left, right): 
    for i in range(left+1,right+1):
        #element = array[i]
        j = i
        while array[j-1][11]>array[j][11] and j>left :
            array[[j, j-1]] = array[[j-1, j]]
            j -= 1
        #array[j+1] = element
    return array
              
def merge(array, l, m, r): 
  
    array_length1= m - l + 1
    array_length2 = r - m 
    left = []
    right = []
    for i in range(0, array_length1): 
        left.append(array[l + i]) 
    for i in range(0, array_length2): 
        right.append(array[m + 1 + i]) 
  
    i=0
    j=0
    k=l
   
    while j < array_length2 and  i < array_length1: 
        if left[i][11] <= right[j][11]:
            array[k] = left[i]
            i += 1
  
        else: 
            array[k] = right[j]
            j += 1
  
        k += 1
  
    while i < array_length1: 
        array[k] = left[i] 
        k += 1
        i += 1
  
    while j < array_length2: 
        array[k] = right[j] 
        k += 1
        j += 1
  
def tim_sort(array): 
    n = len(array) 
    minrun = find_minrun(n) 
  
    for start in range(0, n, minrun): 
        end = min(start + minrun - 1, n - 1) 
        insertion_sort(array, start, end) 
   
    size = minrun 
    while size < n: 
  
        for left in range(0, n, 2 * size): 
  
            mid = min(n - 1, left + size - 1) 
            right = min((left + 2 * size - 1), (n - 1)) 
            merge(array, left, mid, right) 
  
        size = 2 * size 

tim_sort(ds_arr)

Here is what's happening:

E.g., for chunk size = 38 (calculated during run)

ds_arr:
[(0, 127.062386, 233.88705, 2043.5269, 0.9048116, 0.91757613, 40.91537, 0.03138579, 0.03187117, 0.01391112, 7452.246, 16),
(0, 185.04272, 327.6841, 673.2232, 0.94436234, 0.8363992, 34.514076, 0.0662952, 0.05692514, 0.11432385, 2303.1208, 24),
(0, 215.6241, 213.80653, 1756.7979, 0.89432126, 0.9836176, 36.561146, 0.03355443, 0.03731155, 0.0907836, 6127.373, 123),
... (+16 elements),
(0, 244.40192, 115.47513, 1171.4187, 0.9789816, 0.9536665, 26.107262, 0.04582449, 0.04447512, 0.02585861, 3681.2678, 6),
(0, 252.34537, 376.03607, 1285.4608, 0.93303615, 0.9969181, 34.071854, 0.0423173, 0.04572778, 0.06407942, 4192.847, 18),
(0, 343.2354, 207.10603, 859.55945, 0.93482053, 0.9706468, 35.455666, 0.05554156, 0.05821768, 0.03690968, 2773.9412, 34),
... (+16 elements),
other chunks]

left:
[(0, 127.062386, 233.88705, 2043.5269, 0.9048116, 0.91757613, 40.91537, 0.03138579, 0.03187117, 0.01391112, 7452.246, 16),
(0, 185.04272, 327.6841, 673.2232, 0.94436234, 0.8363992, 34.514076, 0.0662952, 0.05692514, 0.11432385, 2303.1208, 24),
(0, 215.6241, 213.80653, 1756.7979, 0.89432126, 0.9836176, 36.561146, 0.03355443, 0.03731155, 0.0907836, 6127.373, 123),
... (+16 elements)]

right:
[(0, 244.40192, 115.47513, 1171.4187, 0.9789816, 0.9536665, 26.107262, 0.04582449, 0.04447512, 0.02585861, 3681.2678, 6),
(0, 252.34537, 376.03607, 1285.4608, 0.93303615, 0.9969181, 34.071854, 0.0423173, 0.04572778, 0.06407942, 4192.847, 18),
(0, 343.2354, 207.10603, 859.55945, 0.93482053, 0.9706468, 35.455666, 0.05554156, 0.05821768, 0.03690968, 2773.9412, 34),
... (+16 elements)]

sorted ds_arr:
[(0, 244.40192, 115.47513, 1171.4187, 0.9789816, 0.9536665, 26.107262, 0.04582449, 0.04447512, 0.02585861, 3681.2678, 6),
(0, 244.40192, 115.47513, 1171.4187, 0.9789816, 0.9536665, 26.107262, 0.04582449, 0.04447512, 0.02585861, 3681.2678, 6),
(0, 244.40192, 115.47513, 1171.4187, 0.9789816, 0.9536665, 26.107262, 0.04582449, 0.04447512, 0.02585861, 3681.2678, 6),
... (+35 times),
similarly other chunks]

instead of getting a sorted list.

Upvotes: 0

Views: 46

Answers (1)

wwii
wwii

Reputation: 23773

You can sort an array based on one of its columns by using argsort and integer array indexing.

>>> a
array([[  7,   1,   1,  -3],
       [ -2, -14,  -8,   5],
       [  1,   7,  14,  10],
       [ 10,  -1,   3,   7],
       [-15,  -3, -13,   0],
       [ -9, -14,  12, -12]], dtype=int64)
>>> a[:,1].argsort()
array([1, 5, 4, 3, 0, 2], dtype=int64)
>>> b = a[:,1].argsort()
>>> a[b]
array([[ -2, -14,  -8,   5],
       [ -9, -14,  12, -12],
       [-15,  -3, -13,   0],
       [ 10,  -1,   3,   7],
       [  7,   1,   1,  -3],
       [  1,   7,  14,  10]], dtype=int64)
>>>

I did not spend time trying to sort out your sorting attempt but I suspect that you need to append copies to your list instead of views.

>>> a
array([[ -2,  -8,   3,  -8],
       [ 12,   2, -12, -14],
       [  0,  -7,   3,   7],
       [ -9,  12, -12,  10],
       [ 14,  -4, -12,  11],
       [  7,  10,  14,   9]], dtype=int64)

When you append a slice of an array to a list you are creating a reference to a view of that slice. Changing an element in the original array will be reflected in the view.

>>> q = []
>>> a[0]  
array([-2, -8,  3, -8], dtype=int64)
>>> q.append(a[0])
>>> q
[array([-2, -8,  3, -8], dtype=int64)]
>>> a[0,1] = 99
>>> a
array([[ -2,  99,   3,  -8],
       [ 12,   2, -12, -14],
       [  0,  -7,   3,   7],
       [ -9,  12, -12,  10],
       [ 14,  -4, -12,  11],
       [  7,  10,  14,   9]], dtype=int64)
>>> q
[array([-2, 99,  3, -8], dtype=int64)]

When you make a copy of an array slice changes in the original array are NOT reflected in the copy.

>>> q.append(a[1].copy())
>>> q
[array([-2, 99,  3, -8], dtype=int64), array([ 12,   2, -12, -14], dtype=int64)]
>>> a[1,1] = 98
>>> a
array([[ -2,  99,   3,  -8],
       [ 12,  98, -12, -14],
       [  0,  -7,   3,   7],
       [ -9,  12, -12,  10],
       [ 14,  -4, -12,  11],
       [  7,  10,  14,   9]], dtype=int64)
>>> q
[array([-2, 99,  3, -8], dtype=int64), array([ 12,   2, -12, -14], dtype=int64)]
>>>            

How can I tell if NumPy creates a view or a copy?

Upvotes: 0

Related Questions