Reputation: 27
I'm new to Python and I want to understand how I can remove values from my dataset that are 0.00000 In context, I am working on the dataset https://www.kaggle.com/ksuchris2000/oklahoma-earthquakes-and-saltwater-injection-wells
The file InjectionWells.csv has some values in their coordinates (LAT and LONG) which I need to remove but I don't know exactly how. This is so I can make a scatterplot with X longitude and Y latitude
I tried the following but didn't work. Can you please guide me?
Upvotes: 0
Views: 1333
Reputation: 33960
dat.info()
to see which columns are numeric, what the dtypes are. You are interested in LAT
, LONG
.dat[['LAT','LONG']].describe()
on your two columns of interest to get descriptive statistics and find out their outlier values..describe()
takes an argument percentiles
which is a list, it defaults to
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles.>>> pd.options.display.float_format = '{:.2f}'.format # suppress unwanted dp's
>>> dat[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
# OR:
>>> dat[dat['LAT'].between(33.97,36.96) & dat['LONG'].between(-101.80,-95.48)]
LAT LONG
count 11125.00 11125.00
mean 35.21 -96.85
std 2.69 7.58
min 0.00 -203.63
1% 33.97 -101.80 # <---- 1st percentile
5% 34.20 -99.76
10% 34.29 -98.25
25% 34.44 -97.63
50% 35.15 -97.37
90% 36.78 -95.95
95% 36.85 -95.74
99% 36.96 -95.48 # <---- 99th percentile
max 73.99 97.70
So the 1st-99th percentile ranges of your LAT and LONG values are:
33.97 <= LAT <= 36.96
-101.80 <= LONG <= -95.48
apply(..., axis=1)
: dat2 = dat[ dat.apply(lambda row: (33.97<=row['LAT']<= 36.96) and (-101.80<=row['LONG']<=-95.48), axis=1) ]
API# Operator Operator ID WellType ... ZONE Unnamed: 18 Unnamed: 19 Unnamed: 20
0 3500300026.00 PHOENIX PETROCORP INC 19499.00 2R ... CHEROKEE NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
11121 3515323507.00 SANDRIDGE EXPLORATION & PRODUCTION LLC 22281.00 2D ... MUSSELLEM, OKLAHOMA NaN NaN NaN
[10760 rows x 21 columns]
Note this has gone from 11125 down to 10760 rows. So we dropped 365 rows.
Finally it's always a good idea to check that the extreme values of your filtered LAT, LONG
are in the range you expected:
>>> dat2[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
LAT LONG
count 10760.00 10760.00
mean 35.33 -97.25
std 0.91 1.11
min 33.97 -101.76
1% 34.08 -101.62
5% 34.21 -99.19
10% 34.30 -98.20
25% 34.44 -97.62
50% 35.13 -97.36
90% 36.77 -95.99
95% 36.83 -95.80
99% 36.93 -95.56
max 36.96 -95.49
PS there's nothing magical about taking 1st/99th percentiles. You can play with the describe(... percentiles)
yourself. You could use 0.005, 0.002, 0.001 percentiles etc. - you get to decide what constitutes an outlier.
Upvotes: 3
Reputation: 10545
You can create a Boolean series by comparing a column of a dataframe to a single value. Then you can use that series to index the dataframe, so that only those rows that meet the condition are selected:
data = df[['LONG', 'LAT']]
data = data[data['LONG'] < -75]
Upvotes: 0