Reputation: 13
(Beginner programmer please be nice!) I am trying to run a linear regression on csv files for certain years, but some of the files are lacking data for a certain year or years. The linear regression function I'm using through sklearn automatically seems to convert these NaN values to 0, which messes up the results of my regressions for that particular csv file. Here is what I have in my loop currently:
munilist = ["Adjuntas", "Anasco", "Ciales", "Jayuya", "Lares", "LasMarias", "Maricao", "Mayaguez", "Orocovis", "Penuelas", "Ponce", "SabanaGrande", "SanGerman", "SanSebastian", "Utuado", "Yauco"]
for municipality in munilist:
x = np.array([1987, 1992, 1998, 2002, 2007, 2012])
x = x.reshape(6,1)
y = np.array(df[df["Municipio"]==municipality].iloc[0, 1:7]).reshape(6,1)
mask = x[~pd.isna(x)] & y[~pd.isna(y)]
xlin = np.arange(1987, 2013,1) #range of years to plot
reg = LinearRegression(fit_intercept=True).fit(x[mask], y[mask])
a0 = reg.intercept_
a1 = reg.coef_[0]
I'm not even sure if I did the mask right, but I keep getting this error when I try to use the mask: arrays used as indices must be of integer (or boolean) type
Upvotes: 0
Views: 149
Reputation: 54897
The problem here is subtle:
mask = x[~pd.isna(x)] & y[~pd.isna(y)]
isna
returns a boolean mask array already. So, x[~pd.isna(x)]
is actually returning the elements of x that meet the criteria, not the boolean mask array. The fix is simple, just combine the masks themselves:
mask = ~pd.isna(x) & ~pd.isna(y)
Upvotes: 1