Pandas Group 2-D NumPy Data by Range of Values

Question

I have a large data set in the form of a 2D array. The 2D array represents continuous intensity data and I want to use this to create another 2D array of the same size only this time, the values are grouped into discreet values. In other words if I have a 2D array like this,

[(11, 23, 33, 12),
 (21, 31, 13, 19),
 (33, 22, 26, 31)]

The output would be as shown below with the values from 10 to 19 assigned to 1, 20 to 29 assigned to 2 and 30 to 39 assigned to 3.

[(1, 2, 3, 1),
 (2, 3, 1, 1),
 (3, 2, 2, 3)]

More ideally, I would like to make these assignments based on percentiles. As in, the values that fall into the top ten percent get assigned to 5, the values in the top 20 to 4 and so on.

My data set is in a NumPy format. I have looked at the functions groupby but this does not seem to allow me to specify ranges. I have also looked at cut however cut only works on 1D arrays. I have considered running the cut function through a loop as I go through each row of the data but I am concerned that this may take too much time. My matrices could be as big as 4000 rows by 4000 columns.

harpan · Accepted Answer

You need to stack the dataframe to have a 1-D representation and then apply cut. After that you can unstack it.

[tuple(x) for x in (pd.cut(pd.DataFrame(a).stack(), bins=[10,20,30,40], labels=False)+1).unstack().values]

OR (using @user3483203's magic)

[tuple(x) for x in np.searchsorted([10, 20, 30, 40], np.array(a))]

Output:

[(1, 2, 3, 1), 
 (2, 3, 1, 1), 
 (3, 2, 2, 3)]

Pandas Group 2-D NumPy Data by Range of Values

Answers (1)

Related Questions