Reputation: 1
currently i am learning how to preprocess data, so on this df:
df = pd.read_csv(...)
df.info()
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Country</th>
<th>Year</th>
<th>Status</th>
<th>Population</th>
<th>Hepatitis B</th>
<th>Measles</th>
<th>Polio</th>
<th>Diphtheria</th>
<th>HIV/AIDS</th>
<th>infant deaths</th>
<th>under-five deaths</th>
<th>Total expenditure</th>
<th>GDP</th>
<th>BMI</th>
<th>thinness 1-19 years</th>
<th>Alcohol</th>
<th>Schooling</th>
<th>Life expectancy</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Afghanistan</td>
<td>2015</td>
<td>Developing</td>
<td>33736494.0</td>
<td>65.0</td>
<td>1154</td>
<td>6.0</td>
<td>65.0</td>
<td>0.1</td>
<td>62</td>
<td>83</td>
<td>8.16</td>
<td>584.259210</td>
<td>19.1</td>
<td>17.2</td>
<td>0.01</td>
<td>10.1</td>
<td>65.0</td>
</tr>
<tr>
<th>1</th>
<td>Afghanistan</td>
<td>2014</td>
<td>Developing</td>
<td>327582.0</td>
<td>62.0</td>
<td>492</td>
<td>58.0</td>
<td>62.0</td>
<td>0.1</td>
<td>64</td>
<td>86</td>
<td>8.18</td>
<td>612.696514</td>
<td>18.6</td>
<td>17.5</td>
<td>0.01</td>
<td>10.0</td>
<td>59.9</td>
</tr>
<tr>
<th>2</th>
<td>Afghanistan</td>
<td>2013</td>
<td>Developing</td>
<td>31731688.0</td>
<td>64.0</td>
<td>430</td>
<td>62.0</td>
<td>64.0</td>
<td>0.1</td>
<td>66</td>
<td>89</td>
<td>8.13</td>
<td>631.744976</td>
<td>18.1</td>
<td>17.7</td>
<td>0.01</td>
<td>9.9</td>
<td>59.9</td>
</tr>
<tr>
<th>3</th>
<td>Afghanistan</td>
<td>2012</td>
<td>Developing</td>
<td>3696958.0</td>
<td>67.0</td>
<td>2787</td>
<td>67.0</td>
<td>67.0</td>
<td>0.1</td>
<td>69</td>
<td>93</td>
<td>8.52</td>
<td>669.959000</td>
<td>17.6</td>
<td>17.9</td>
<td>0.01</td>
<td>9.8</td>
<td>59.5</td>
</tr>
<tr>
<th>4</th>
<td>Afghanistan</td>
<td>2011</td>
<td>Developing</td>
<td>2978599.0</td>
<td>68.0</td>
<td>3013</td>
<td>68.0</td>
<td>68.0</td>
<td>0.1</td>
<td>71</td>
<td>97</td>
<td>7.87</td>
<td>63.537231</td>
<td>17.2</td>
<td>18.2</td>
<td>0.01</td>
<td>9.5</td>
<td>59.2</td>
</tr>
</tbody>
</table>
</div>
and then i excluded not numerical values:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics).iloc[:,1:]
newdf
<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Population</th>
<th>Hepatitis B</th>
<th>Measles</th>
<th>Polio</th>
<th>Diphtheria</th>
<th>HIV/AIDS</th>
<th>infant deaths</th>
<th>under-five deaths</th>
<th>Total expenditure</th>
<th>GDP</th>
<th>BMI</th>
<th>thinness 1-19 years</th>
<th>Alcohol</th>
<th>Schooling</th>
<th>Life expectancy</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>33736494.0</td>
<td>65.0</td>
<td>1154</td>
<td>6.0</td>
<td>65.0</td>
<td>0.1</td>
<td>62</td>
<td>83</td>
<td>8.16</td>
<td>584.259210</td>
<td>19.1</td>
<td>17.2</td>
<td>0.01</td>
<td>10.1</td>
<td>65.0</td>
</tr>
<tr>
<th>1</th>
<td>327582.0</td>
<td>62.0</td>
<td>492</td>
<td>58.0</td>
<td>62.0</td>
<td>0.1</td>
<td>64</td>
<td>86</td>
<td>8.18</td>
<td>612.696514</td>
<td>18.6</td>
<td>17.5</td>
<td>0.01</td>
<td>10.0</td>
<td>59.9</td>
</tr>
<tr>
<th>2</th>
<td>31731688.0</td>
<td>64.0</td>
<td>430</td>
<td>62.0</td>
<td>64.0</td>
<td>0.1</td>
<td>66</td>
<td>89</td>
<td>8.13</td>
<td>631.744976</td>
<td>18.1</td>
<td>17.7</td>
<td>0.01</td>
<td>9.9</td>
<td>59.9</td>
</tr>
<tr>
<th>3</th>
<td>3696958.0</td>
<td>67.0</td>
<td>2787</td>
<td>67.0</td>
<td>67.0</td>
<td>0.1</td>
<td>69</td>
<td>93</td>
<td>8.52</td>
<td>669.959000</td>
<td>17.6</td>
<td>17.9</td>
<td>0.01</td>
<td>9.8</td>
<td>59.5</td>
</tr>
<tr>
<th>4</th>
<td>2978599.0</td>
<td>68.0</td>
<td>3013</td>
<td>68.0</td>
<td>68.0</td>
<td>0.1</td>
<td>71</td>
<td>97</td>
<td>7.87</td>
<td>63.537231</td>
<td>17.2</td>
<td>18.2</td>
<td>0.01</td>
<td>9.5</td>
<td>59.2</td>
</tr>
</tbody>
</table>
</div>
and in order to findout what is the type of each col dist, i plot them:
from matplotlib import pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(5, 3, figsize=(15, 15))
for i in range(newdf.shape[-1]):
# i need kdeplot of each col in order to findout the outliers
sns.kdeplot(newdf.iloc[:,i], ax=ax[i//3, i%3], fill=True, color='red')
so what is each type of col dist, and what can i do to remove or replace outliers in these dataframe with different dist?
i did try IQR but i guess IQR is appropriated for Normal dist, but every time i used IQR, df filled with NaN values
Upvotes: 0
Views: 44