Reputation: 33
I like to retrieve data based on the column name and its minimum and maximum value. I am not able to figure out how to get that result. I am able to get data based on column name but don't understand how to apply the limit.
Column name and corresponding min and max value given in list and tuple.
import pandas as pd
import numpy as np
def c_cutoff(data_frame, column_cutoff):
selected_data = data_frame.loc[:, [X[0] for X in column_cutoff]]
return selected_data
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
column_cutoffdata = [('B',27,78),('E',44,73)]
newdata_cutoff = c_cutoff(df,column_cutoffdata)
print(df.head())
print(newdata_cutoff)
result
B E
R0 78 73
R1 27 7
R2 53 44
R3 65 84
R4 9 1
..
.
Expected output I want all value less than 27 and greater than 78 should be discarded, same for E
Upvotes: 1
Views: 370
Reputation: 164623
pipe
+ where
+ between
You can't discard values in an array; that would involve reshaping an array and a dataframe's columns must all have the same size.
But you can iterate and use pd.Series.where
to replace out-of-scope vales with NaN
. Note the Pandas way to feed a dataframe through a function is via pipe
:
import pandas as pd
import numpy as np
def c_cutoff(data_frame, column_cutoff):
for col, min_val, max_val in column_cutoffdata:
data_frame[col] = data_frame[col].where(data_frame[col].between(min_val, max_val))
return data_frame
np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
columns=list('ABCDEF'),
index=['R{}'.format(i) for i in range(100)])
column_cutoffdata = [('B',27,78),('E',44,73)]
print(df.head())
# A B C D E F
# R0 99 78 61 16 73 8
# R1 62 27 30 80 7 76
# R2 15 53 80 27 44 77
# R3 75 65 47 30 84 86
# R4 18 9 41 62 1 82
newdata_cutoff = df.pipe(c_cutoff, column_cutoffdata)
print(newdata_cutoff.head())
# A B C D E F
# R0 99 78.0 61 16 73.0 8
# R1 62 27.0 30 80 NaN 76
# R2 15 53.0 80 27 44.0 77
# R3 75 65.0 47 30 NaN 86
# R4 18 NaN 41 62 NaN 82
If you want to drop rows with any NaN
values, you can then use dropna
:
newdata_cutoff = newdata_cutoff.dropna()
Upvotes: 1
Reputation: 9019
You can be rather explicit and do the following:
lim = [('B',27,78),('E',44,73)]
for lim in limiters:
df = df[(df[lim[0]]>=lim[1]) & (df[lim[0]]<=lim[2])]
Yields:
A B C D E F
R0 99 78 61 16 73 8
R2 15 53 80 27 44 77
R8 30 62 11 67 65 55
R11 90 31 9 38 47 16
R15 16 64 8 90 44 37
R16 94 75 5 22 52 69
R46 11 30 26 8 51 61
R48 39 59 22 80 58 44
R66 55 38 5 49 58 15
R70 36 78 5 13 73 69
R72 70 58 52 99 67 11
R75 20 59 57 33 53 96
R77 32 31 89 49 69 41
R79 43 28 17 16 73 54
R80 45 34 90 67 69 70
R87 9 50 16 61 65 30
R90 43 56 76 7 47 62
Upvotes: 1