arkadiy
arkadiy

Reputation: 766

How to fix an error with np.where function?

I am attempting to recode column values in pandas using a combination of 'where' and 'count' functions. The desired result is to select 200 random rows from rows that are labeled as "Low_Valence" and 200 random rows from rows that are labeled as "Low_Valence", within the "valence_median_split" column. However, this does not seem to be working.

Here is the df:

df.head()

Out[34]: 
              ID Category  Num Vert_Horizon Description  Fem_Valence_Mean  \
0  Animals_001_h  Animals    1            h  Dead Stork              2.40   
1  Animals_002_v  Animals    2            v        Lion              6.31   
2  Animals_003_h  Animals    3            h       Snake              5.14   
3  Animals_004_v  Animals    4            v        Wolf              4.55   
4  Animals_005_h  Animals    5            h         Bat              5.29   

   Fem_Valence_SD  Fem_Av_Ap_Mean  Fem_Av/Ap_SD  Arousal_Mean  \
0            1.30            3.03          1.47          6.72   
1            2.19            5.96          2.24          6.69   
2            1.19            5.14          1.75          5.34   
3            1.87            4.82          2.27          6.84   
4            1.56            4.61          1.81          5.50   

          Luminance  Contrast  JPEG_size80   LABL   LABA  \
0          ...              126.05     68.45       263028  51.75  -0.39   
1          ...              123.41     32.34       250208  52.39  10.63   
2          ...              135.28     59.92       190887  55.45   0.25   
3          ...              122.15     75.10       282350  49.84   3.82   
4          ...              131.81     59.77       329325  54.26  -0.34   

    LABB  Entropy  Classification  temp_selection  valence_median_split  
0  16.93     7.86                            High           Low_Valence  
1  30.30     6.71                             NaN          High_Valence  
2   4.41     7.83                            High           Low_Valence  
3   1.36     7.69                            High           Low_Valence  
4  -0.95     7.82                            High           Low_Valence  

[5 rows x 35 columns]

Here is what I tried:

df['temp_selection'] = ''
df['temp_selection'] = np.where(df['valence_median_split'] == 'Low_Valence', df['valence_median_split'].sample(n=200).reindex(df.index), 'Low')
df['temp_selection'] = np.where(df['valence_median_split'] == 'High_Valence', df['valence_median_split'].sample(n=200).reindex(df.index), 'High')
df.temp_selection.unique()

However, the results indicate that this did not work:

array(['High', nan, 'High_Valence'], dtype=object)

I am wondering if there is an error with combining these functions.

Here is a reproducible example:

d = {'col1': [1, 2, 3, 4, 3, 3, 2, 2], 'col2': [1, 2, 3, 4, 3, 3, 2, 2]}
df = pd.DataFrame(data=d)
df['valence_median_split'] = ''
#Get median of valence
valence_median = df['col1'].median()
df['valence_median_split'] = np.where(df['col2'] < valence_median, 'Low_Valence', 'High_Valence')
df['temp_selection'] = ''
df['temp_selection'] = np.where(df['valence_median_split'] == 'Low_Valence', df['valence_median_split'].sample(n=2).reindex(df.index), 'Low')
df['temp_selection'] = np.where(df['valence_median_split'] == 'High_Valence', df['valence_median_split'].sample(n=2).reindex(df.index), 'High')
df
   col1  col2 valence_median_split temp_selection
0     1     1          Low_Valence           High
1     2     2          Low_Valence           High
2     3     3         High_Valence   High_Valence
3     4     4         High_Valence            NaN
4     3     3         High_Valence            NaN
5     3     3         High_Valence   High_Valence
6     2     2          Low_Valence           High
7     2     2          Low_Valence           High

As can be seen in the df above, there is a 'High_Valence' classification within 'temp_selection' that should not be there, and no 'Low' classifications.

Upvotes: 1

Views: 173

Answers (1)

jezrael
jezrael

Reputation: 862431

Idea is get indices of sample of filtered data ans instead double np.where use numpy.select:

low = df.loc[df['valence_median_split'] == 'Low_Valence', 
                'valence_median_split'].sample(n=2).index
high = df.loc[df['valence_median_split'] == 'High_Valence',
                 'valence_median_split'].sample(n=2).index
df['temp_selection'] = np.select([df.index.isin(low), df.index.isin(high)],
                                 ['Low', 'High'], default=np.nan)

Or:

df['temp_selection'] = np.where(df.index.isin(low), 'Low', 
                       np.where(df.index.isin(high), 'High', np.nan))

print (df)
   col1  col2 valence_median_split temp_selection
0     1     1          Low_Valence            nan
1     2     2          Low_Valence            Low
2     3     3         High_Valence            nan
3     4     4         High_Valence            nan
4     3     3         High_Valence           High
5     3     3         High_Valence           High
6     2     2          Low_Valence            nan
7     2     2          Low_Valence            Low

Or:

df.loc[low, 'temp_selection'] = 'Low'
df.loc[high, 'temp_selection'] = 'High'
print (df)
   col1  col2 valence_median_split temp_selection
0     1     1          Low_Valence            NaN
1     2     2          Low_Valence            Low
2     3     3         High_Valence            NaN
3     4     4         High_Valence            NaN
4     3     3         High_Valence           High
5     3     3         High_Valence           High
6     2     2          Low_Valence            NaN
7     2     2          Low_Valence            Low

Another ide is use numpy.random.choice:

low = np.random.choice(df.index[df['valence_median_split'] == 'Low_Valence'], size=2)
high = np.random.choice(df.index[df['valence_median_split']== 'High_Valence'], size=2)

df.loc[low, 'temp_selection'] = 'Low'
df.loc[high, 'temp_selection'] = 'High'

Upvotes: 1

Related Questions