Reputation: 1
I have a DF, however the last value of some series should be placed in a different one. This happened due to column names not being standardized - i.e., some are "Wx_y_x_PRED" and some are "Wx_x_y_PRED". I'm having difficulty writing a function that will simply find the columns with >= 225 NaN's and changing the column it's assigned to.
I've written a function that for some reason will sometimes work and sometimes won't. When it does, it further creates approx 850 columns in its wake (the OG dataframe is around 420 with the duplicate columns). I'm hoping to have something that just reassigns the value. If it automatically deletes the incorrect column, that's awesome too, but I just used .dropna(thresh = 2) when my function worked originally.
Here's what it looks like originally:
in: df = pd.DataFrame(data = {'W10_IND_JAC_PRED': ['NaN','NaN','NaN','NaN','NaN',2],
'W10_JAC_IND_PRED': [1,2,1,2,1,'NAN']})
out:df
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 NAN
I wrote this, which occasionally works but most of the time doesn't and i'm not sure why.
def switch_cols(x):
"""Takes mismatched columns (where only the last value != NaN) and changes order of team column names"""
if x.isna().sum() == 5:
col_string = x.name.split('_')
col_to_switch = ('_').join([col_string[0],col_string[2],col_string[1],'PRED'])
df[col_to_switch]['row_name'] = x[-1]
else:
pass
return x
Most of the time it just returns to me the exact same DF, but this is the desired outcome.
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 2
Anyone have any tips or could share why my function works maybe 10% of the time?
Edit:
so this is an ugly "for" loop I wrote that works. I know there has to be a much more pythonic way of doing this while preserving original column names, though.
for i in range(df.shape[1]):
if df.iloc[:,i].isna().sum() == 5:
split_nan_col = df.columns[i].split('_')
correct_col_name = ('_').join([split_nan_col[0],split_nan_col[2],split_nan_col[1],split_nan_col[3]])
df.loc[5,correct_col_name] = df.loc[5,df.columns[i]]
else:
pass
Upvotes: 0
Views: 81
Reputation: 323306
Doing with split
before frozenset
(will return the order list), then we do join
: Notice this solution can be implemented to more columns
df.columns=df.columns.str.split('_').map(frozenset).map('_'.join)
df.mask(df=='NaN').groupby(level=0,axis=1).first() # groupby first will return the first not null value
PRED_JAC_W10_IND
0 1
1 2
2 1
3 2
4 1
5 2
Upvotes: 1