Francis Russell
Francis Russell

Reputation: 13

Removing NaN values from np array for linear regression

(Beginner programmer please be nice!) I am trying to run a linear regression on csv files for certain years, but some of the files are lacking data for a certain year or years. The linear regression function I'm using through sklearn automatically seems to convert these NaN values to 0, which messes up the results of my regressions for that particular csv file. Here is what I have in my loop currently:

    munilist = ["Adjuntas", "Anasco", "Ciales", "Jayuya", "Lares", "LasMarias", "Maricao", "Mayaguez", "Orocovis", "Penuelas", "Ponce", "SabanaGrande", "SanGerman", "SanSebastian", "Utuado", "Yauco"]
    for municipality in munilist:

        x = np.array([1987, 1992, 1998, 2002, 2007, 2012])
        x = x.reshape(6,1)
        y = np.array(df[df["Municipio"]==municipality].iloc[0, 1:7]).reshape(6,1)
        mask = x[~pd.isna(x)] & y[~pd.isna(y)]
        xlin = np.arange(1987, 2013,1) #range of years to plot
        reg = LinearRegression(fit_intercept=True).fit(x[mask], y[mask])
        a0 = reg.intercept_
        a1 = reg.coef_[0]

I'm not even sure if I did the mask right, but I keep getting this error when I try to use the mask: arrays used as indices must be of integer (or boolean) type

Upvotes: 0

Views: 149

Answers (1)

Tim Roberts
Tim Roberts

Reputation: 54897

The problem here is subtle:

        mask = x[~pd.isna(x)] & y[~pd.isna(y)]

isna returns a boolean mask array already. So, x[~pd.isna(x)] is actually returning the elements of x that meet the criteria, not the boolean mask array. The fix is simple, just combine the masks themselves:

        mask = ~pd.isna(x) & ~pd.isna(y)

Upvotes: 1

Related Questions