resunga
resunga

Reputation: 43

Insert value in numpy array with conditions

I want to insert the value in the NumPy array as follows,

  1. If Nth row is the same as (N-1)th row insert 1 for Nth row and (N-1)th row and rest 0
  2. If Nth row is different from (N_1)th row then change column and repeat condition Here is the example
d = {'col1': [2,2, 3,3,3, 4,4, 5,5,5,],
     'col2': [3,3, 4,4,4, 1,1, 0,0,0]}
df = pd.DataFrame(data=d)
np.zeros((10,4))
###########################################################
OUTPUT MATRIX

         1     0     0    0   First two rows are the same so 1,1 in a first column
         1     0     0    0
         0     1     0    0   Three-rows are same 1,1,1
         0     1     0    0
         0     1     0    0
         0     0     1    0   Again two rows are the same 1,1
         0     0     1    0
         0     0     0    1  Again three rows are same 1,1,1
         0     0     0    1
         0     0     0    1

Upvotes: 2

Views: 121

Answers (1)

mozway
mozway

Reputation: 261850

IIUC, you can achieve this simply with numpy indexing:

# group by successive identical values
group = df.ne(df.shift()).all(1).cumsum().sub(1)

# craft the numpy array
a = np.zeros((len(group), group.max()+1), dtype=int)
a[np.arange(len(df)), group] = 1

print(a)

Alternative with numpy.identity:

# group by successive identical values
group = df.ne(df.shift()).all(1).cumsum().sub(1)
shape = df.groupby(group).size()

# craft the numpy array
a = np.repeat(np.identity(len(shape), dtype=int), shape, axis=0)

print(a)

output:

array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 1],
       [0, 0, 0, 1]])

intermediates:

group
0    0
1    0
2    1
3    1
4    1
5    2
6    2
7    3
8    3
9    3
dtype: int64

shape
0    2
1    3
2    2
3    3
dtype: int64
other option

for fun, likely no so efficient on large inputs:

a = pd.get_dummies(df.agg(tuple, axis=1)).to_numpy()

Note that this second option uses groups of identical values, not successive identical values. For identical values with the first (numpy) approach, you would need to use group = df.groupby(list(df)).ngroup() and the numpy indexing option (this wouldn't work with repeating the identity).

Upvotes: 2

Related Questions