renakre
renakre

Reputation: 8291

Replacing values in large number of columns with another column value based on a condition

I have this data:

id   |  d1   |  d2  |  d3  | .... |  d64   | FINAL_GRADE
1    |  0    |  15  |  0   | .... |  23    | 95
2    |  8    |  0   |  12  | .... |  0     | 75   

And I want to replace all non-zero values in each row with the corresponding value in the FINAL_GRADE column, and obtain this table:

id   |  d1   |  d2  |  d3  | .... |  d64   | FINAL_GRADE
1    |  0    |  95  |  0   | .... |  95    | 95
2    |  75   |  0   |  75  | .... |  0     | 75   

Here is my code:

df[df.ix[:, 1:63] != 0] = df['FINAL_GRADE']

But, I am receiving this error: TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value

I wonder if my code has any issues? Or my approach is totally wrong. I appreciate any help!

Upvotes: 2

Views: 118

Answers (2)

Nickil Maveli
Nickil Maveli

Reputation: 29711

One possibility would be to use DF.mask() method on the created boolean mask.

Using .ix for setting values fails here presumably because you're operating on the subset of columns having mixed dtypes (float - due to the generation of the bool mask and subsetting operation & int - values in FINAL_GRADE to be looked up at).

This would be the main cause of the TypeError getting generated.

Steps:

1) Subset the dataframe by selecting the columns starting with the char d using str.startswith.

2) Using DF.mask for the conditions wherein the values in this subset are non-zero, we replace them with the contents present in FINAL_GRADE row-wise by specifying axis=0.

3) Finally, concatenate the id, FINAL_GRADE and the masked DF column-wise using pd.concat(axis=1)


sub_df = df[df.columns[df.columns.str.startswith('d')]]
mask_df = sub_df.mask(sub_df != 0, df['FINAL_GRADE'], axis=0)
pd.concat([df['id'], mask_df, df['FINAL_GRADE']], axis=1)

enter image description here

Upvotes: 3

instant
instant

Reputation: 696

The following might be slightly more crude than strictly necessary, but I think it is a clean and generalized fit for your problem:

for _, row in df.iterrows():
    row[0:-1][row != 0] = row.FINAL_GRADE

Note that I'm doing a couple of things here, so some notes:

  1. row[0:-1] will address all row items except the last one, so .ix is not necessary when using integer indexes for this case, and you are not locked into the case where you have exactly 64 columns.
  2. I am looping over all rows, which is generally not considered the most efficient way of doing things, but I find it readable and sufficiently ok for cases such as yours that are not high-performance calculations repeated hundreds of times.
  3. _ is convention for ignoring a variable, which in this case is the row index that iterrows() automatically gives me.
  4. Try to use .loc more than .ix because it leverages the semantic benefit that labelling your data gives you.

I'll try to think of a solution without a for loop that is considered pythonic and not too contrived or unreadable.

EDIT: Found a one-liner that is in my opinion both readable and simple/general enough to be applied to other/similar problems:

df.ix[:, 0:-1] = df.ix[:, 0:-1].where(df == 0, df.FINAL_GRADE, axis=0)

Upvotes: 1

Related Questions