Reputation: 156
I'm using the following public dataset to practice linear regression:
https://www.kaggle.com/theforcecoder/wind-power-forecasting
I tried to do a least squares regression using numpy polynomial, and I ran into issues because the columns had nan values
applying dropna to the dataframe from where i extract the columns does not have an effect. I tried both using in_place=True and defining a new dataframe, but neither works:
LSFitdDf = BearingTempsCorr[['WindSpeed', 'BearingShaftTemperature']]
LSFitdDf = LSFitdDf[['WindSpeed', 'BearingShaftTemperature']]
WindSpeed = BearingTempsCorr['WindSpeed']
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature']
print(len(WindSpeed))
print(len(BearingShaftTemperature))
and
LSFitdDf = BearingTempsCorr[['WindSpeed', 'BearingShaftTemperature']].dropna()
LSFitdDf = LSFitdDf[['WindSpeed', 'BearingShaftTemperature']]
WindSpeed = BearingTempsCorr['WindSpeed']
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature']
print(len(WindSpeed))
print(len(BearingShaftTemperature))
Both produce the same output (length of both columns=323)
However, applying dropna to the columns themselves does drop rows:
WindSpeed = BearingTempsCorr['WindSpeed'].dropna()
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature'].dropna()
results in lengths=(316, 312)
However this introducces a new problem: regression cannot be applied because x and y have different lengths
What is going on here?
Upvotes: 1
Views: 151
Reputation: 973
There is an error in your code:
WindSpeed = BearingTempsCorr['WindSpeed']
BearingShaftTemperature = BearingTempsCorr['BearingShaftTemperature']
You use BearingTempsCorr
, but you should use LSFitdDf
(where you saved dropna values).
WindSpeed = LSFitdDf['WindSpeed']
BearingShaftTemperature = LSFitdDf['BearingShaftTemperature']
P.S. You also have redundant line, which just copies the LSFitdDf
into the same variable.
LSFitdDf = LSFitdDf[['WindSpeed', 'BearingShaftTemperature']]
P.P.S. The most clear way to get the whole dataset but drop lines with NA values in desired columns is
BearingTempsCorr.dropna(subset=['WindSpeed', 'BearingShaftTemperature'])
Upvotes: 1