Bhaskar
Bhaskar

Reputation: 33

Panda Dataframe query

I like to retrieve data based on the column name and its minimum and maximum value. I am not able to figure out how to get that result. I am able to get data based on column name but don't understand how to apply the limit.

Column name and corresponding min and max value given in list and tuple.

import pandas as pd
import numpy as np

def c_cutoff(data_frame, column_cutoff):

    selected_data = data_frame.loc[:, [X[0] for X in column_cutoff]]

    return selected_data


np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
                  columns=list('ABCDEF'),
                  index=['R{}'.format(i) for i in range(100)])

column_cutoffdata = [('B',27,78),('E',44,73)]

newdata_cutoff = c_cutoff(df,column_cutoffdata)
print(df.head())


print(newdata_cutoff)

result

    B   E
R0   78  73
R1   27   7
R2   53  44
R3   65  84
R4    9   1
..
.

Expected output I want all value less than 27 and greater than 78 should be discarded, same for E

Upvotes: 1

Views: 370

Answers (2)

jpp
jpp

Reputation: 164623

pipe + where + between

You can't discard values in an array; that would involve reshaping an array and a dataframe's columns must all have the same size.

But you can iterate and use pd.Series.where to replace out-of-scope vales with NaN. Note the Pandas way to feed a dataframe through a function is via pipe:

import pandas as pd
import numpy as np

def c_cutoff(data_frame, column_cutoff):
    for col, min_val, max_val in column_cutoffdata:
        data_frame[col] = data_frame[col].where(data_frame[col].between(min_val, max_val))
    return data_frame

np.random.seed(5)
df = pd.DataFrame(np.random.randint(100, size=(100, 6)),
                  columns=list('ABCDEF'),
                  index=['R{}'.format(i) for i in range(100)])

column_cutoffdata = [('B',27,78),('E',44,73)]

print(df.head())

#      A   B   C   D   E   F
# R0  99  78  61  16  73   8
# R1  62  27  30  80   7  76
# R2  15  53  80  27  44  77
# R3  75  65  47  30  84  86
# R4  18   9  41  62   1  82

newdata_cutoff = df.pipe(c_cutoff, column_cutoffdata)

print(newdata_cutoff.head())

#      A     B   C   D     E   F
# R0  99  78.0  61  16  73.0   8
# R1  62  27.0  30  80   NaN  76
# R2  15  53.0  80  27  44.0  77
# R3  75  65.0  47  30   NaN  86
# R4  18   NaN  41  62   NaN  82

If you want to drop rows with any NaN values, you can then use dropna:

newdata_cutoff = newdata_cutoff.dropna()

Upvotes: 1

rahlf23
rahlf23

Reputation: 9019

You can be rather explicit and do the following:

lim = [('B',27,78),('E',44,73)]

for lim in limiters:
    df = df[(df[lim[0]]>=lim[1]) & (df[lim[0]]<=lim[2])]

Yields:

      A   B   C   D   E   F
R0   99  78  61  16  73   8
R2   15  53  80  27  44  77
R8   30  62  11  67  65  55
R11  90  31   9  38  47  16
R15  16  64   8  90  44  37
R16  94  75   5  22  52  69
R46  11  30  26   8  51  61
R48  39  59  22  80  58  44
R66  55  38   5  49  58  15
R70  36  78   5  13  73  69
R72  70  58  52  99  67  11
R75  20  59  57  33  53  96
R77  32  31  89  49  69  41
R79  43  28  17  16  73  54
R80  45  34  90  67  69  70
R87   9  50  16  61  65  30
R90  43  56  76   7  47  62

Upvotes: 1

Related Questions