Lorenzo
Lorenzo

Reputation: 131

How to deal with missing values in Pandas DataFrame?

I have a Pandas Dataframe that has some missing values. I would like to fill the missing values with something that doesn't influence the statistics that I will do on the data.

As an example, if in Excel you try to average a cell that contains 5 and an empty cell, the average will be 5. I'd like to have the same in Python.

I tried to fill with NaN but if I sum a certain column, for example, the result is NaN. I also tried to fill with None but I get an error because I'm summing different datatypes.

Can somebody help? Thank you in advance.

Upvotes: 0

Views: 4256

Answers (5)

MD. SHIFULLAH
MD. SHIFULLAH

Reputation: 1721

I'm providing code with input and output data:

Input:

Original DataFrame:
A    B   C
0  1.0  5.0  10
1  2.0  NaN  11
2  NaN  NaN  12
3  4.0  8.0  13

Code:

import pandas as pd
import numpy as np
 
data = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [10, 11, 12, 13]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

"""
    Im filling here None value with Zero(0)
"""
df_filled = df.fillna(0)
print("DataFrame after filling missing values:")
print(df_filled)

Output:

DataFrame after filling missing values:
A    B   C
0  1.0  5.0  10
1  2.0  0.0  11
2  0.0  0.0  12
3  4.0  8.0  13

Upvotes: 0

Ashu007
Ashu007

Reputation: 795

If you want to change the datatype of any specific column with missing values filled with 'nan' for any statistical operation you can simply use below line of code, it will convert all the values of that column to numeric type and all the missing values automatically replace with 'nan' and it'll not affect your statistical operation.

df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')

If you want to do the same for all the columns in dataframe you can use:

for i in df.columns:
   df[i] = pd.to_numeric(df[i], errors='coerce')

Upvotes: 0

Saumya Mehta
Saumya Mehta

Reputation: 191

You can use df.fillna(). Here is an example of how you can do the same.

import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan,2,1,np.nan],
              [2,np.nan,3,4],
              [4,np.nan,np.nan,3],
              [np.nan,2,1,np.nan]],columns=list('ABCD'))
df.fillna(0.0)

Generally filling value with something like 0 would affect the statistics you do on your data. So go for mean of the data which will make sure it won't affect your statistics. So, use df.fillna(df.mean()) instead

Upvotes: 0

Josh Friedlander
Josh Friedlander

Reputation: 11657

The answer to your question is that missing values work differently in Pandas than in Excel. You can read about the technical reasons for that here. Basically, there is no magic number that we can fill a df with that will cause Pandas to just overlook it. Depending on our needs, we will sometimes choose to fill the missing values, sometimes to drop them (either permanently or for the duration of a calculation), or sometimes to use methods that can work with them (e.g. numpy.nansum, as Philipe Riskalla Leal mentioned).

Upvotes: 0

Philipe Riskalla Leal
Philipe Riskalla Leal

Reputation: 1066

there are many answers for your two questions.

Here is a solution for your first one:

If you wish to insert a certain value to your NaN entries in the Dataframe that won't alter your statistics, then I would suggest you to use the mean value of that data for it.

Example:

df # your dataframe with NaN values

df.fillna(df.mean(), inplace=True)

For the second question:

If you need to check descriptive statistics from your dataframe, and that descriptive stats should not be influenced by the NaN values, here are two solutions for it: 1)

df # your dataframe with NaN values

df.fillna(df.mean(), inplace=True)

df.mean()
df.std()

# or even:

df.describe()

2) Option 2:

I would suggest you to use the numpy nan functions such as (numpy.nansum, numpy.nanmean, numpy.nanstd)...

df.apply(numpy.nansum)

df.apply(numpy.nanstd) #...

Upvotes: 4

Related Questions