Ricardo Ortega
Ricardo Ortega

Reputation: 27

Data cleaning of (lat,long) coordinates

I'm new to Python and I want to understand how I can remove values from my dataset that are 0.00000 In context, I am working on the dataset https://www.kaggle.com/ksuchris2000/oklahoma-earthquakes-and-saltwater-injection-wells

The file InjectionWells.csv has some values in their coordinates (LAT and LONG) which I need to remove but I don't know exactly how. This is so I can make a scatterplot with X longitude and Y latitude

I tried the following but didn't work. Can you please guide me?

Upvotes: 0

Views: 1333

Answers (2)

smci
smci

Reputation: 33960

  1. You need to discover the outlier values on LAT, LONG
    • your plot is one way, but here's an automated way
  2. First, use dat.info() to see which columns are numeric, what the dtypes are. You are interested in LAT, LONG.
  3. Use dat[['LAT','LONG']].describe() on your two columns of interest to get descriptive statistics and find out their outlier values.
  • .describe() takes an argument percentiles which is a list, it defaults to [.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
  • ..but you want to exclude rare/outlier values, so try including (say) the 1st/99th and 5th/95th percentiles also:
>>> pd.options.display.float_format = '{:.2f}'.format  # suppress unwanted dp's
>>> dat[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
# OR:
>>> dat[dat['LAT'].between(33.97,36.96) & dat['LONG'].between(-101.80,-95.48)]

           LAT     LONG
count 11125.00 11125.00
mean     35.21   -96.85
std       2.69     7.58
min       0.00  -203.63
1%       33.97  -101.80  # <---- 1st percentile
5%       34.20   -99.76
10%      34.29   -98.25
25%      34.44   -97.63
50%      35.15   -97.37
90%      36.78   -95.95
95%      36.85   -95.74
99%      36.96   -95.48  # <---- 99th percentile
max      73.99    97.70

So the 1st-99th percentile ranges of your LAT and LONG values are:

  33.97 <= LAT  <=  36.96
-101.80 <= LONG <= -95.48
  1. So now you can exclude these with a one-line apply(..., axis=1):
    dat2 = dat[ dat.apply(lambda row: (33.97<=row['LAT']<= 36.96) and (-101.80<=row['LONG']<=-95.48), axis=1) ]

                   API#                                Operator  Operator ID WellType  ...                                   ZONE Unnamed: 18  Unnamed: 19 Unnamed: 20
    0     3500300026.00                   PHOENIX PETROCORP INC     19499.00       2R  ...                               CHEROKEE         NaN          NaN         NaN
    ...             ...                                     ...          ...      ...  ...                                    ...         ...          ...         ...
    11121 3515323507.00  SANDRIDGE EXPLORATION & PRODUCTION LLC     22281.00       2D  ...                    MUSSELLEM, OKLAHOMA         NaN          NaN         NaN
    
    [10760 rows x 21 columns]

Note this has gone from 11125 down to 10760 rows. So we dropped 365 rows.

Finally it's always a good idea to check that the extreme values of your filtered LAT, LONG are in the range you expected:

>>> dat2[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
           LAT     LONG
count 10760.00 10760.00
mean     35.33   -97.25
std       0.91     1.11
min      33.97  -101.76
1%       34.08  -101.62
5%       34.21   -99.19
10%      34.30   -98.20
25%      34.44   -97.62
50%      35.13   -97.36
90%      36.77   -95.99
95%      36.83   -95.80
99%      36.93   -95.56
max      36.96   -95.49

PS there's nothing magical about taking 1st/99th percentiles. You can play with the describe(... percentiles) yourself. You could use 0.005, 0.002, 0.001 percentiles etc. - you get to decide what constitutes an outlier.

Upvotes: 3

Arne
Arne

Reputation: 10545

You can create a Boolean series by comparing a column of a dataframe to a single value. Then you can use that series to index the dataframe, so that only those rows that meet the condition are selected:

data = df[['LONG', 'LAT']]
data = data[data['LONG'] < -75]

Upvotes: 0

Related Questions