littleal4
littleal4

Reputation: 13

How to select dataframe column values in a specified range?

This is my code:

df = pd.read_csv("/content/Intel_AI4Y/My Drive/Intel_AI4Y_Colab/Module_16/data/Students_Score1.csv")

names = ["Student No." ,"Hours spent studying in a day", "Mathematics score", "English score","Science score"]

df.columns = names

Mathematics_score = df.iloc[:, 0]

df = df[~df.iloc[:, 0].between(100, 0, inclusive=False)]

print(df.describe())

print (df.info())

I'm trying to remove erroneous data from Mathematics score, value that is below 0 or above 100. I'm not sure how I'm suppose to go about coding this. Can anyone help?

Upvotes: 1

Views: 1081

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62373

  • df = df[~df.iloc[:, 0].between(100, 0, inclusive=False)] is almost correct
  • pandas.Series.between requires a left and right boundary, which should be 0 and 100 respectively.
  • ~ is not so in effect df.iloc[:, 0].between(0, 100, inclusive=False) returns everything between 0 and 100, but ~df.iloc[:, 0].between(0, 100, inclusive=False) return values <=0 and >=100.
  • To return values between 0 and 100, use df[df.iloc[:, 0].between(0, 100, inclusive=False)]
  • Also see Pandas: Indexing and selecting data
  • See Pandas: Selection by position for the proper use of .iloc. df.iloc[:, 0] means you have selected all rows, : and the column at index 0. My sample data only has one column, so index 0. You need to verify the index for your column of interest.
import pandas as pd
import numpy as np

# sample dataframe
np.random.seed(100)
df = pd.DataFrame({'values': [np.random.randint(-100, 200) for _ in range(500)]})

# values between 0 and 100
df[df.iloc[:, 0].between(0, 100, inclusive=False)]

 values
     43
     37
     55
     41
     35

# values <=0 or >=100
df[~df.iloc[:, 0].between(0, 100, inclusive=False)]

 values
    -92
    180
    -21
    -47
    -34

Upvotes: 1

Christopher
Christopher

Reputation: 731

Since your data frame comes with headers. I would really suggest to use a mask filter as follows.

df = df[(df['Mathematics score'] > 0) & (df['Mathematics score'] < 100)]

As suggested by @Trenton McKinney, it is true that using iloc sometimes is easier because you don't have to type the column name.

So, in your case, because the column Mathematics score is the third one, so you should do:

df[~df.iloc[:, 2].between(0, 100, inclusive=False)]

Upvotes: 0

Related Questions