finding out the outliers for different distribution

Question

currently i am learning how to preprocess data, so on this df:

df = pd.read_csv(...)
df.info()




  
    
      
      Country
      Year
      Status
      Population
      Hepatitis B
      Measles
      Polio
      Diphtheria
      HIV/AIDS
      infant deaths
      under-five deaths
      Total expenditure
      GDP
      BMI
      thinness  1-19 years
      Alcohol
      Schooling
      Life expectancy
    
  
  
    
      0
      Afghanistan
      2015
      Developing
      33736494.0
      65.0
      1154
      6.0
      65.0
      0.1
      62
      83
      8.16
      584.259210
      19.1
      17.2
      0.01
      10.1
      65.0
    
    
      1
      Afghanistan
      2014
      Developing
      327582.0
      62.0
      492
      58.0
      62.0
      0.1
      64
      86
      8.18
      612.696514
      18.6
      17.5
      0.01
      10.0
      59.9
    
    
      2
      Afghanistan
      2013
      Developing
      31731688.0
      64.0
      430
      62.0
      64.0
      0.1
      66
      89
      8.13
      631.744976
      18.1
      17.7
      0.01
      9.9
      59.9
    
    
      3
      Afghanistan
      2012
      Developing
      3696958.0
      67.0
      2787
      67.0
      67.0
      0.1
      69
      93
      8.52
      669.959000
      17.6
      17.9
      0.01
      9.8
      59.5
    
    
      4
      Afghanistan
      2011
      Developing
      2978599.0
      68.0
      3013
      68.0
      68.0
      0.1
      71
      97
      7.87
      63.537231
      17.2
      18.2
      0.01
      9.5
      59.2

and then i excluded not numerical values:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = df.select_dtypes(include=numerics).iloc[:,1:]
newdf




  
    
      
      Population
      Hepatitis B
      Measles
      Polio
      Diphtheria
      HIV/AIDS
      infant deaths
      under-five deaths
      Total expenditure
      GDP
      BMI
      thinness  1-19 years
      Alcohol
      Schooling
      Life expectancy
    
  
  
    
      0
      33736494.0
      65.0
      1154
      6.0
      65.0
      0.1
      62
      83
      8.16
      584.259210
      19.1
      17.2
      0.01
      10.1
      65.0
    
    
      1
      327582.0
      62.0
      492
      58.0
      62.0
      0.1
      64
      86
      8.18
      612.696514
      18.6
      17.5
      0.01
      10.0
      59.9
    
    
      2
      31731688.0
      64.0
      430
      62.0
      64.0
      0.1
      66
      89
      8.13
      631.744976
      18.1
      17.7
      0.01
      9.9
      59.9
    
    
      3
      3696958.0
      67.0
      2787
      67.0
      67.0
      0.1
      69
      93
      8.52
      669.959000
      17.6
      17.9
      0.01
      9.8
      59.5
    
    
      4
      2978599.0
      68.0
      3013
      68.0
      68.0
      0.1
      71
      97
      7.87
      63.537231
      17.2
      18.2
      0.01
      9.5
      59.2

and in order to findout what is the type of each col dist, i plot them:

from matplotlib import pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(5, 3, figsize=(15, 15))

for i in range(newdf.shape[-1]):
    # i need kdeplot of each col in order to findout the outliers
    sns.kdeplot(newdf.iloc[:,i], ax=ax[i//3, i%3], fill=True, color='red')

img

so what is each type of col dist, and what can i do to remove or replace outliers in these dataframe with different dist?

i did try IQR but i guess IQR is appropriated for Normal dist, but every time i used IQR, df filled with NaN values

finding out the outliers for different distribution

Answers (0)

Related Questions