Reputation: 131
I have a Pandas Dataframe that has some missing values. I would like to fill the missing values with something that doesn't influence the statistics that I will do on the data.
As an example, if in Excel you try to average a cell that contains 5 and an empty cell, the average will be 5. I'd like to have the same in Python.
I tried to fill with NaN
but if I sum a certain column, for example, the result is NaN
.
I also tried to fill with None but I get an error because I'm summing different datatypes.
Can somebody help? Thank you in advance.
Upvotes: 0
Views: 4256
Reputation: 1721
I'm providing code with input and output data:
Input:
Original DataFrame:
A B C
0 1.0 5.0 10
1 2.0 NaN 11
2 NaN NaN 12
3 4.0 8.0 13
Code:
import pandas as pd
import numpy as np
data = {
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [10, 11, 12, 13]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
"""
Im filling here None value with Zero(0)
"""
df_filled = df.fillna(0)
print("DataFrame after filling missing values:")
print(df_filled)
Output:
DataFrame after filling missing values:
A B C
0 1.0 5.0 10
1 2.0 0.0 11
2 0.0 0.0 12
3 4.0 8.0 13
Upvotes: 0
Reputation: 795
If you want to change the datatype of any specific column with missing values filled with 'nan' for any statistical operation you can simply use below line of code, it will convert all the values of that column to numeric type and all the missing values automatically replace with 'nan' and it'll not affect your statistical operation.
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')
If you want to do the same for all the columns in dataframe you can use:
for i in df.columns:
df[i] = pd.to_numeric(df[i], errors='coerce')
Upvotes: 0
Reputation: 191
You can use df.fillna(). Here is an example of how you can do the same.
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan,2,1,np.nan],
[2,np.nan,3,4],
[4,np.nan,np.nan,3],
[np.nan,2,1,np.nan]],columns=list('ABCD'))
df.fillna(0.0)
Generally filling value with something like 0 would affect the statistics you do on your data. So go for mean of the data which will make sure it won't affect your statistics. So, use df.fillna(df.mean()) instead
Upvotes: 0
Reputation: 11657
The answer to your question is that missing values work differently in Pandas than in Excel. You can read about the technical reasons for that here. Basically, there is no magic number that we can fill a df with that will cause Pandas to just overlook it. Depending on our needs, we will sometimes choose to fill the missing values, sometimes to drop them (either permanently or for the duration of a calculation), or sometimes to use methods that can work with them (e.g. numpy.nansum
, as Philipe Riskalla Leal mentioned).
Upvotes: 0
Reputation: 1066
there are many answers for your two questions.
Here is a solution for your first one:
If you wish to insert a certain value to your NaN entries in the Dataframe that won't alter your statistics, then I would suggest you to use the mean value of that data for it.
Example:
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
For the second question:
If you need to check descriptive statistics from your dataframe, and that descriptive stats should not be influenced by the NaN values, here are two solutions for it: 1)
df # your dataframe with NaN values
df.fillna(df.mean(), inplace=True)
df.mean()
df.std()
# or even:
df.describe()
2) Option 2:
I would suggest you to use the numpy nan functions such as (numpy.nansum, numpy.nanmean, numpy.nanstd)...
df.apply(numpy.nansum)
df.apply(numpy.nanstd) #...
Upvotes: 4