ghylander
ghylander

Reputation: 156

pandas dropna giving different results when applied to a dataframe with 2 columns or the columns as independednt dataframes

I'm using the following public dataset to practice linear regression:

https://www.kaggle.com/theforcecoder/wind-power-forecasting

I tried to do a least squares regression using numpy polynomial, and I ran into issues because the columns had nan values

applying dropna to the dataframe from where i extract the columns does not have an effect. I tried both using in_place=True and defining a new dataframe, but neither works:

LSFitdDf = BearingTempsCorr[['WindSpeed', 'BearingShaftTemperature']]
LSFitdDf = LSFitdDf[['WindSpeed', 'BearingShaftTemperature']]

WindSpeed = BearingTempsCorr['WindSpeed']
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature']

print(len(WindSpeed))
print(len(BearingShaftTemperature))

and

LSFitdDf = BearingTempsCorr[['WindSpeed', 'BearingShaftTemperature']].dropna()
LSFitdDf = LSFitdDf[['WindSpeed', 'BearingShaftTemperature']]

WindSpeed = BearingTempsCorr['WindSpeed']
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature']

print(len(WindSpeed))
print(len(BearingShaftTemperature))

Both produce the same output (length of both columns=323)

However, applying dropna to the columns themselves does drop rows:

WindSpeed = BearingTempsCorr['WindSpeed'].dropna()
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature'].dropna()

results in lengths=(316, 312)

However this introducces a new problem: regression cannot be applied because x and y have different lengths

What is going on here?

Upvotes: 1

Views: 151

Answers (1)

Leonid Mednikov
Leonid Mednikov

Reputation: 973

There is an error in your code:

WindSpeed = BearingTempsCorr['WindSpeed'] 
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature']

You use BearingTempsCorr, but you should use LSFitdDf (where you saved dropna values).

WindSpeed = LSFitdDf['WindSpeed'] 
BearingShaftTemperature = LSFitdDf['BearingShaftTemperature']

P.S. You also have redundant line, which just copies the LSFitdDf into the same variable.

LSFitdDf = LSFitdDf[['WindSpeed', 'BearingShaftTemperature']]

P.P.S. The most clear way to get the whole dataset but drop lines with NA values in desired columns is

BearingTempsCorr.dropna(subset=['WindSpeed', 'BearingShaftTemperature'])

Upvotes: 1

Related Questions