Reputation: 21
I have a dataframe in pandas, and I am trying to take data from the same row and different columns and fill NaN values in my data. How would I do this in pandas?
For example,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
83 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0 NaN 28.0 18.0
The goal is for the data to look like this:
1 2 3 4 5 6 7 ... 10 11 12 13 14 15 16
83 NaN NaN NaN 27.0 29.0 29.0 30.0 ... 15.0 16.0 17.0 28.0 30.0 28.0 18.0
The goal is to be able to take the mean of the last five columns that have data. If there are not >= 5 data-filled cells, then take the average of however many cells there are.
Upvotes: 1
Views: 78
Reputation: 862481
Use function justify
for improve performance with filter all columns without first by DataFrame.iloc
:
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob 27.0 29.0 NaN 29.0 30.0 NaN NaN 15.0 16.0 17.0 NaN 28.0 30.0
14 15 16
80 NaN 28.0 18.0
df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
print (df)
name 1 2 3 4 5 6 7 8 9 10 11 12 13 \
80 bob NaN NaN NaN NaN NaN 27.0 29.0 29.0 30.0 15.0 16.0 17.0 28.0
14 15 16
80 30.0 28.0 18.0
Function:
#https://stackoverflow.com/a/44559180/2901002
def justify(a, invalid_val=0, axis=1, side='left'):
"""
Justifies a 2D array
Parameters
----------
A : ndarray
Input array to be justified
axis : int
Axis along which justification is to be made
side : str
Direction of justification. It could be 'left', 'right', 'up', 'down'
It should be 'left' or 'right' for axis=1 and 'up' or 'down' for axis=0.
"""
if invalid_val is np.nan:
mask = ~np.isnan(a)
else:
mask = a!=invalid_val
justified_mask = np.sort(mask,axis=axis)
if (side=='up') | (side=='left'):
justified_mask = np.flip(justified_mask,axis=axis)
out = np.full(a.shape, invalid_val)
if axis==1:
out[justified_mask] = a[mask]
else:
out.T[justified_mask.T] = a.T[mask.T]
return out
Performance:
#100 rows
df = pd.concat([df] * 100, ignore_index=True)
#41 times slowier
In [39]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
145 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [41]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
3.54 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#1000 rows
df = pd.concat([df] * 1000, ignore_index=True)
#198 times slowier
In [43]: %timeit df.loc[:,df.columns[1:]] = df.loc[:,df.columns[1:]].apply(fun, axis=1)
1.13 s ± 37.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [45]: %timeit df.iloc[:, 1:] = justify(df.iloc[:, 1:].to_numpy(), invalid_val=np.nan, side='right')
5.7 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 1
Reputation: 4021
Assuming you need to move all NaN
to the first columns I would define a function that takes all NaN
and places them first and leave the rest as it is:
def fun(row):
index_order = row.index[row.isnull()].append(row.index[~row.isnull()])
row.iloc[:] = row[index_order].values
return row
df_fix = df.loc[:,df.columns[1:]].apply(fun, axis=1)
If you need to overwrite the results in the same dataframe then:
df.loc[:,df.columns[1:]] = df_fix.copy()
Upvotes: 1