abbudeh
abbudeh

Reputation: 65

how to sort complex structured data with numpy?

I have a file whose lines consist of 2 integers and one float. I read the file with numpy:

dt = np.dtype([('pre', np.dtype('i4'), 2),('data', np.float64, 1)])
a = np.fromfile("myfile", dtype=dt)

array([([65536, 65536], 0.2       ), ([65536,     1], 1.33566434),
       ([65536,     2], 2.06068931), ..., ([65535,   479], 0.33333333),
       ([65535,  2295], 0.09090909), ([65535,   249], 0.07692308)],
      dtype=[('pre', '<i4', (2,)), ('data', '<f8')])

I actually have two questions: When I iterate a with np.nditer I can't access a[0][0][0] for example Why is that and how to use np.nditer ? Second question: How can I sort the elements after the first entry in the ['pre'] list and then after the second entry in ['pre'] The wanted output would look like:

array([([1, 1], 0.2       ), ([1,     2], 1.33566434),
       ([1,     3], 2.06068931), ..., ([2,   1], 0.33333333),
       ([2,  2], 0.09090909), ([2,   3], 0.07692308)],
      dtype=[('pre', '<i4', (2,)), ('data', '<f8')])

Any suggestions are welcome, even if changing the data type for reading the file would help. Performance is needed as well because the file I have is very large. Thanks

Upvotes: 0

Views: 61

Answers (1)

hpaulj
hpaulj

Reputation: 231395

You have a 1d structured array:

In [56]: arr = np.array([([65536, 65536], 0.2       ), ([65536,     1], 1.3356
    ...: 6434),
    ...:        ([65536,     2], 2.06068931), ([65535,   479], 0.33333333),
    ...:        ([65535,  2295], 0.09090909), ([65535,   249], 0.07692308)],
    ...:       dtype=[('pre', '<i4', (2,)), ('data', '<f8')])
    ...:       
In [57]: arr
Out[57]: 
array([([65536, 65536], 0.2       ), ([65536,     1], 1.33566434),
       ([65536,     2], 2.06068931), ([65535,   479], 0.33333333),
       ([65535,  2295], 0.09090909), ([65535,   249], 0.07692308)],
      dtype=[('pre', '<i4', (2,)), ('data', '<f8')])
In [58]: arr.shape
Out[58]: (6,)
In [59]: arr.dtype
Out[59]: dtype([('pre', '<i4', (2,)), ('data', '<f8')])
In [60]: arr['pre']
Out[60]: 
array([[65536, 65536],
       [65536,     1],
       [65536,     2],
       [65535,   479],
       [65535,  2295],
       [65535,   249]], dtype=int32)
In [61]: arr['data']
Out[61]: 
array([0.2       , 1.33566434, 2.06068931, 0.33333333, 0.09090909,
       0.07692308])

It has 2 fields. The pre field has 2 elements, so the arr['pre'] is a 2d numeric array.

As a general rule you don't need to use nditer to iterate through an array. It's useful when developing cython code, but isn't needed in Python code.

If you use nditer you get a () shape array with the original dtype:

In [70]: for x in np.nditer(arr):
    ...:     print(x)

([65536, 65536], 0.2)
([65536,     1], 1.33566434)
([65536,     2], 2.06068931)
([65535,   479], 0.33333333)
([65535,  2295], 0.09090909)
([65535,   249], 0.07692308)

The difference between that direct iteration is subtle. The type in the nditer case is <class 'numpy.ndarray'>. In the direct iteration case <class 'numpy.void'>.

As for the sorting, it sounds like you want np.lexsort using the 2 columns of the 'pre' field:

In [76]: np.lexsort((arr['pre'][:,1], arr['pre'][:,0]))
Out[76]: array([5, 3, 4, 1, 2, 0])
In [77]: arr[_]
Out[77]: 
array([([65535,   249], 0.07692308), ([65535,   479], 0.33333333),
       ([65535,  2295], 0.09090909), ([65536,     1], 1.33566434),
       ([65536,     2], 2.06068931), ([65536, 65536], 0.2       )],
      dtype=[('pre', '<i4', (2,)), ('data', '<f8')])

A similar lexsort was just recommended for numpy sort 2d: rearrange rows without changing values in row

Upvotes: 1

Related Questions