prem
prem

Reputation: 624

Drop rows based on a threshold value of a column

With elasticsearch search Indexed the data. The input file much match query is dob and last name. It has student with same dob. So that file also coming as output. Have an idea about to remove low score rows. How can I approach?

Filename Name      DOB         Score PageNumber
11086   Ram     11 06 1930  6.4504585   1
11086   Ram     11 06 1930  6.4504585   2
11086   Ram     11 06 1930  6.4504585   1
81564   Kiran   11 06 1930  3.5517883   2
81564   Kiran   11 06 1930  3.5517883   33
81564   Kiran   11 06 1930  3.5517883   12
754133  peter   11 06 1930  2.5905614   1
754133  peter   11 06 1930  2.5905614   1

Desired output

Filename Name      DOB         Score PageNumber
11086   Ram     11 06 1930  6.4504585   1
11086   Ram     11 06 1930  6.4504585   2
11086   Ram     11 06 1930  6.4504585   1

Upvotes: 1

Views: 3054

Answers (2)

cs95
cs95

Reputation: 403130

Let's try .std based filtering.

df = df[~((df.Score - df.Score.max()).abs() > df.Score.std())]
df

   Filename Name         DOB     Score  PageNumber
0     11086  Ram  11 06 1930  6.450458           1
1     11086  Ram  11 06 1930  6.450458           2
2     11086  Ram  11 06 1930  6.450458           1

Score.std becomes the dynamic threshold for your data.


Where,

((df.Score - df.Score.max()).abs())

0    0.000000
1    0.000000
2    0.000000
3    2.898670
4    2.898670
5    2.898670
6    3.859897
7    3.859897
Name: Score, dtype: float64

df.Score.std()
1.7451830491923459

df.Score.max()
6.4504584999999999

Upvotes: 3

piRSquared
piRSquared

Reputation: 294516

Supposing you want only rows with a score greater than 3

df.query('Score > 3')

   Filename   Name         DOB     Score  PageNumber
0     11086    Ram  11 06 1930  6.450458           1
1     11086    Ram  11 06 1930  6.450458           2
2     11086    Ram  11 06 1930  6.450458           1
3     81564  Kiran  11 06 1930  3.551788           2
4     81564  Kiran  11 06 1930  3.551788          33
5     81564  Kiran  11 06 1930  3.551788          12

Supposing you want to filter by some multiple of standard deviation

df[df.Score > (df.Score.mean() - 1 * df.Score.std())]

   Filename   Name         DOB     Score  PageNumber
0     11086    Ram  11 06 1930  6.450458           1
1     11086    Ram  11 06 1930  6.450458           2
2     11086    Ram  11 06 1930  6.450458           1
3     81564  Kiran  11 06 1930  3.551788           2
4     81564  Kiran  11 06 1930  3.551788          33
5     81564  Kiran  11 06 1930  3.551788          12

Or you can grab just the rows that equal the max

df.query('Score == @df.Score.max()')

   Filename   Name         DOB     Score  PageNumber
0     11086    Ram  11 06 1930  6.450458           1
1     11086    Ram  11 06 1930  6.450458           2
2     11086    Ram  11 06 1930  6.450458           1

Or

df[df.Score == df.Score.max()]

   Filename   Name         DOB     Score  PageNumber
0     11086    Ram  11 06 1930  6.450458           1
1     11086    Ram  11 06 1930  6.450458           2
2     11086    Ram  11 06 1930  6.450458           1

Upvotes: 4

Related Questions