Reputation: 725
I have the following Dataframe:
import pandas as pd
data = {'MA1': [ float("nan"), float("nan"), -1, 1],
'MA2': [ float("nan"), -1, 0, 0],
'MA3': [ 0, 0, 1, -1]}
df_input = pd.DataFrame(data, columns=['MA1', 'MA2', 'MA3'])
My goal is for every column, if the first non nan and non zero value is -1, to set it to 0.
Clarification:
The goal is only to set to 0 if the first non 0 and non nan value is -1. If it is 1 or anything else, then leave it there.
What is the fastest way to do it?
Upvotes: 0
Views: 537
Reputation: 725
I used a modification of @Erfan's answer.
As I explain in my Update edit, I want to only set it to zero if the first non zero and non nan value is -1. If it anything else, then don't do anything for that column.
df_min = df_input(0, np.NaN).idxmin()
df_max = df_input(0, np.NaN).idxmax()
for col, idx in df_min.iteritems():
if df_input[idx, col] == -1 and idx < df_max[col]:
df_input[idx, col] = 0
Upvotes: 0
Reputation: 16683
Being at the end of one year using python, I'm trying to be better at implementing higher performing solutions, so I thought I would test the performance of my answer versus other's (realizing that my answer would be the slowest -- from the dataframe I created, it ended up being 50,000x
slower than the best answer! Woah!). Also, here is a good article about pandas and performance: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
My traditional slow looping method looped through 3 columns almost 100,000 times (length of dataframe), while the best answer looped through 3 columns one time as it idx.min()
identified the relevant row, making it unnecessary to loop through them all.
Here is a dataframe with 100,000 rows and 4 columns that I used to test vs. @Erfan and @DerekO:
df_input = pd.DataFrame(np.random.randint(0, 10, size=(100000,4)).astype(float), columns=list('ABCD'))
df_input.iloc[99998:, 0:4] = -1
My Answer (slowest) 2.78 s ± 269 ms per loop
:
for col in df_input.columns:
for row in range(len(df_input.index)):
if df_input.loc[row, col] == -1:
df_input.loc[row, col] = 0
break
df_input
Derek O's answer #1: 283 ms ± 13.2 ms per loop
10x faster than my answer!
Erfan's answer #1: 2.73 ms ± 135 µs per loop
1,000x faster than my answer!
Erfan's answer #2: 54.8 µs ± 5.65 µs per loop
50,000x faster than my answer!
Upvotes: 1
Reputation: 42886
You can loop over the columns and use DataFrame.loc
to assign the 0 when the first valid value is -1
:
dft = df_input.replace(0, np.NaN)
for col in df_input.columns:
idxmin = dft[col].idxmin()
if df_input.loc[idxmin, col] == -1:
df_input.loc[idxmin, col] = 0
MA1 MA2 MA3
0 NaN NaN 0
1 NaN 0.0 0
2 0.0 0.0 1
3 1.0 0.0 0
Or more efficient by using DataFrame.idxmin
instead so we dont have to to call Series.idxmin
for each iteration in our loop:
dft = df_input.replace(0, np.NaN).idxmin()
for col, idx in dft.iteritems():
if df_input.loc[idx, col] == -1:
df_input.loc[idx, col] = 0
MA1 MA2 MA3
0 NaN NaN 0
1 NaN 0.0 0
2 0.0 0.0 1
3 1.0 0.0 0
Upvotes: 3
Reputation: 19545
Apply a custom function to each column. The custom function loops through the column's values to find the first non-nan, non-zero value, then returns the new column.
import numpy as np
import pandas as pd
def set_column(col_values):
for index, value in enumerate(col_values):
if value != 0 and not np.isnan(value):
if value == -1:
col_values[index] = 0
return col_values
else:
return col_values
data = {'MA1': [ float("nan"), float("nan"), -1, 1],
'MA2': [ float("nan"), -1, 0, 0],
'MA3': [ 0, 0, 1, 0]}
df_input = pd.DataFrame(data, columns=['MA1', 'MA2', 'MA3'])
df_output = df_input.copy().apply(lambda x: set_column(x), axis = 0)
Output:
>>> df_output
MA1 MA2 MA3
0 NaN NaN 0
1 NaN 0.0 0
2 0.0 0.0 1
3 1.0 0.0 0
Upvotes: 0