Thanh Nguyen
Thanh Nguyen

Reputation: 912

Pandas set value in a column equal to 5% quantile if they are smaller than that

Generating data

random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
                  columns=['data1', 'data2', 'data3'],
                 index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
df[mask] = np.nan

I want to do the following operation: calculate the 5% quantile of each column, then compare the value of each cell in that column with the calculated quantile: if they are smaller, set them to the 5% quantile of the column.

I have read those questions

Pandas DataFrame: replace all values in a column, based on condition

Replacing values greater than a number in pandas dataframe

and come up with my solution:

df[df < df.quantile(q=0.05, axis=0)] = df.quantile(q=0.05, axis=0)

but it's not working, because I'm trying to replace each value with a series. How can I solve this problem? Thank you

Upvotes: 2

Views: 932

Answers (1)

jezrael
jezrael

Reputation: 862691

You can get quantile by DataFrame.quantile of all columns and pass it to DataFrame.clip.

np.random.seed(42)
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(np.random.randint(0,10,size=(len(date_rng), 3)),
                  columns=['data1', 'data2', 'data3'],
                 index= date_rng)
mask = np.random.choice([1, 0], df.shape, p=[.35, .65]).astype(bool)
print (df)
                     data1  data2  data3
2018-01-01 00:00:00      6      3      7
2018-01-01 01:00:00      4      6      9
2018-01-01 02:00:00      2      6      7
2018-01-01 03:00:00      4      3      7
2018-01-01 04:00:00      7      2      5
                   ...    ...    ...
2018-01-07 20:00:00      7      6      4
2018-01-07 21:00:00      0      6      6
2018-01-07 22:00:00      8      2      8
2018-01-07 23:00:00      0      0      3
2018-01-08 00:00:00      8      5      2

For testing is used different quantile:

print (df.quantile(q=0.55))
data1    6.0
data2    4.0
data3    5.0
Name: 0.55, dtype: float64

df = df.clip(lower=df.quantile(q=0.55), axis=1)
print (df)
                     data1  data2  data3
2018-01-01 00:00:00      6      4      7
2018-01-01 01:00:00      6      6      9
2018-01-01 02:00:00      6      6      7
2018-01-01 03:00:00      6      4      7
2018-01-01 04:00:00      7      4      5
                   ...    ...    ...
2018-01-07 20:00:00      7      6      5
2018-01-07 21:00:00      6      6      6
2018-01-07 22:00:00      8      4      8
2018-01-07 23:00:00      6      4      5
2018-01-08 00:00:00      8      5      5

Upvotes: 2

Related Questions