Reputation: 13
This is my code:
df = pd.read_csv("/content/Intel_AI4Y/My Drive/Intel_AI4Y_Colab/Module_16/data/Students_Score1.csv")
names = ["Student No." ,"Hours spent studying in a day", "Mathematics score", "English score","Science score"]
df.columns = names
Mathematics_score = df.iloc[:, 0]
df = df[~df.iloc[:, 0].between(100, 0, inclusive=False)]
print(df.describe())
print (df.info())
I'm trying to remove erroneous data from Mathematics score, value that is below 0 or above 100. I'm not sure how I'm suppose to go about coding this. Can anyone help?
Upvotes: 1
Views: 1081
Reputation: 62373
df = df[~df.iloc[:, 0].between(100, 0, inclusive=False)]
is almost correctpandas.Series.between
requires a left and right boundary, which should be 0
and 100
respectively.~
is not
so in effect df.iloc[:, 0].between(0, 100, inclusive=False)
returns everything between 0 and 100, but ~df.iloc[:, 0].between(0, 100, inclusive=False)
return values <=0
and >=100
.df[df.iloc[:, 0].between(0, 100, inclusive=False)]
.iloc
. df.iloc[:, 0]
means you have selected all rows, :
and the column at index 0
. My sample data only has one column, so index 0
. You need to verify the index for your column of interest.import pandas as pd
import numpy as np
# sample dataframe
np.random.seed(100)
df = pd.DataFrame({'values': [np.random.randint(-100, 200) for _ in range(500)]})
# values between 0 and 100
df[df.iloc[:, 0].between(0, 100, inclusive=False)]
values
43
37
55
41
35
# values <=0 or >=100
df[~df.iloc[:, 0].between(0, 100, inclusive=False)]
values
-92
180
-21
-47
-34
Upvotes: 1
Reputation: 731
Since your data frame comes with headers. I would really suggest to use a mask filter as follows.
df = df[(df['Mathematics score'] > 0) & (df['Mathematics score'] < 100)]
As suggested by @Trenton McKinney, it is true that using iloc
sometimes is easier because you don't have to type the column name.
So, in your case, because the column Mathematics score
is the third one, so you should do:
df[~df.iloc[:, 2].between(0, 100, inclusive=False)]
Upvotes: 0