Yunix
Yunix

Reputation: 1

finding out the outliers for different distribution

currently i am learning how to preprocess data, so on this df:

df = pd.read_csv(...)
df.info()
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Country</th>
      <th>Year</th>
      <th>Status</th>
      <th>Population</th>
      <th>Hepatitis B</th>
      <th>Measles</th>
      <th>Polio</th>
      <th>Diphtheria</th>
      <th>HIV/AIDS</th>
      <th>infant deaths</th>
      <th>under-five deaths</th>
      <th>Total expenditure</th>
      <th>GDP</th>
      <th>BMI</th>
      <th>thinness  1-19 years</th>
      <th>Alcohol</th>
      <th>Schooling</th>
      <th>Life expectancy</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Afghanistan</td>
      <td>2015</td>
      <td>Developing</td>
      <td>33736494.0</td>
      <td>65.0</td>
      <td>1154</td>
      <td>6.0</td>
      <td>65.0</td>
      <td>0.1</td>
      <td>62</td>
      <td>83</td>
      <td>8.16</td>
      <td>584.259210</td>
      <td>19.1</td>
      <td>17.2</td>
      <td>0.01</td>
      <td>10.1</td>
      <td>65.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Afghanistan</td>
      <td>2014</td>
      <td>Developing</td>
      <td>327582.0</td>
      <td>62.0</td>
      <td>492</td>
      <td>58.0</td>
      <td>62.0</td>
      <td>0.1</td>
      <td>64</td>
      <td>86</td>
      <td>8.18</td>
      <td>612.696514</td>
      <td>18.6</td>
      <td>17.5</td>
      <td>0.01</td>
      <td>10.0</td>
      <td>59.9</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Afghanistan</td>
      <td>2013</td>
      <td>Developing</td>
      <td>31731688.0</td>
      <td>64.0</td>
      <td>430</td>
      <td>62.0</td>
      <td>64.0</td>
      <td>0.1</td>
      <td>66</td>
      <td>89</td>
      <td>8.13</td>
      <td>631.744976</td>
      <td>18.1</td>
      <td>17.7</td>
      <td>0.01</td>
      <td>9.9</td>
      <td>59.9</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Afghanistan</td>
      <td>2012</td>
      <td>Developing</td>
      <td>3696958.0</td>
      <td>67.0</td>
      <td>2787</td>
      <td>67.0</td>
      <td>67.0</td>
      <td>0.1</td>
      <td>69</td>
      <td>93</td>
      <td>8.52</td>
      <td>669.959000</td>
      <td>17.6</td>
      <td>17.9</td>
      <td>0.01</td>
      <td>9.8</td>
      <td>59.5</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Afghanistan</td>
      <td>2011</td>
      <td>Developing</td>
      <td>2978599.0</td>
      <td>68.0</td>
      <td>3013</td>
      <td>68.0</td>
      <td>68.0</td>
      <td>0.1</td>
      <td>71</td>
      <td>97</td>
      <td>7.87</td>
      <td>63.537231</td>
      <td>17.2</td>
      <td>18.2</td>
      <td>0.01</td>
      <td>9.5</td>
      <td>59.2</td>
    </tr>
  </tbody>
</table>
</div>

and then i excluded not numerical values:

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = df.select_dtypes(include=numerics).iloc[:,1:]
newdf
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Population</th>
      <th>Hepatitis B</th>
      <th>Measles</th>
      <th>Polio</th>
      <th>Diphtheria</th>
      <th>HIV/AIDS</th>
      <th>infant deaths</th>
      <th>under-five deaths</th>
      <th>Total expenditure</th>
      <th>GDP</th>
      <th>BMI</th>
      <th>thinness  1-19 years</th>
      <th>Alcohol</th>
      <th>Schooling</th>
      <th>Life expectancy</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>33736494.0</td>
      <td>65.0</td>
      <td>1154</td>
      <td>6.0</td>
      <td>65.0</td>
      <td>0.1</td>
      <td>62</td>
      <td>83</td>
      <td>8.16</td>
      <td>584.259210</td>
      <td>19.1</td>
      <td>17.2</td>
      <td>0.01</td>
      <td>10.1</td>
      <td>65.0</td>
    </tr>
    <tr>
      <th>1</th>
      <td>327582.0</td>
      <td>62.0</td>
      <td>492</td>
      <td>58.0</td>
      <td>62.0</td>
      <td>0.1</td>
      <td>64</td>
      <td>86</td>
      <td>8.18</td>
      <td>612.696514</td>
      <td>18.6</td>
      <td>17.5</td>
      <td>0.01</td>
      <td>10.0</td>
      <td>59.9</td>
    </tr>
    <tr>
      <th>2</th>
      <td>31731688.0</td>
      <td>64.0</td>
      <td>430</td>
      <td>62.0</td>
      <td>64.0</td>
      <td>0.1</td>
      <td>66</td>
      <td>89</td>
      <td>8.13</td>
      <td>631.744976</td>
      <td>18.1</td>
      <td>17.7</td>
      <td>0.01</td>
      <td>9.9</td>
      <td>59.9</td>
    </tr>
    <tr>
      <th>3</th>
      <td>3696958.0</td>
      <td>67.0</td>
      <td>2787</td>
      <td>67.0</td>
      <td>67.0</td>
      <td>0.1</td>
      <td>69</td>
      <td>93</td>
      <td>8.52</td>
      <td>669.959000</td>
      <td>17.6</td>
      <td>17.9</td>
      <td>0.01</td>
      <td>9.8</td>
      <td>59.5</td>
    </tr>
    <tr>
      <th>4</th>
      <td>2978599.0</td>
      <td>68.0</td>
      <td>3013</td>
      <td>68.0</td>
      <td>68.0</td>
      <td>0.1</td>
      <td>71</td>
      <td>97</td>
      <td>7.87</td>
      <td>63.537231</td>
      <td>17.2</td>
      <td>18.2</td>
      <td>0.01</td>
      <td>9.5</td>
      <td>59.2</td>
    </tr>
  </tbody>
</table>
</div>

and in order to findout what is the type of each col dist, i plot them:

from matplotlib import pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(5, 3, figsize=(15, 15))

for i in range(newdf.shape[-1]):
    # i need kdeplot of each col in order to findout the outliers
    sns.kdeplot(newdf.iloc[:,i], ax=ax[i//3, i%3], fill=True, color='red')

img

so what is each type of col dist, and what can i do to remove or replace outliers in these dataframe with different dist?

i did try IQR but i guess IQR is appropriated for Normal dist, but every time i used IQR, df filled with NaN values

Upvotes: 0

Views: 44

Answers (0)

Related Questions