Reputation: 35

Finding a formula for a relationship between a target and multiple predictor variables

I have a dataset named "covid" of the following shape and head:

number of instances:  19345
number of attributes:  7
  submission_date state  new_case  new_death    density   latitude   longitude
0      2020-06-01    KS       292          9  71.401302  39.011902  -98.484246
1      2020-06-01    WA       271          6  96.704458  47.751074 -120.740139
2      2020-06-01    MT         8          0   6.837955  46.879682 -110.362566
3      2020-06-01    IA       146         15  54.642103  41.878003  -93.097702
4      2020-06-01    KY       136          6        NaN  37.839333  -84.270018

Each row represents a jurisdiction's (state column) per diem covid data along with some info about the jurisdiction- 365 objects per jursidiction (states and some territories).

How can I find a relationship between the submission_date, longitude, and latitude columns as independent variables and the new_case column as the dependent variable? I guess this would be a multiple regression, but I am new to the field and have never created a regression.

Upvotes: 0

Answers (3)

user4718221

Reputation: 606

You can see the regression equation in sklearn using multiple linear regression the following way.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

y = df['new_case'].values
df['submission_date_int'] = df['submission_date'].astype(int)
X = df[['submission_date_int', 'longitude', 'latitude']].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

reg_coeff = pd.DataFrame(regressor.coef_, X.columns, columns=['Reg_coeff'])
reg_coeff

Upvotes: 0

user7864386

Reputation:

As a benchmark, you can run an OLS regression:

import statsmodels.api as sm
Y = df['new_case'].values
df['submission_date_int'] = df['submission_date'].astype(int)
X = df[['submission_date_int', 'longitude', 'latitude']].values
X = sm.add_constant(X)
model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())

Or use sklearn.linear_model.LinearRegression.

Upvotes: 1

rudolfovic

Reputation: 3286

There are many model types and packages that you could use. I'll show an example using catboost:

from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import train_test_split

# Initialize data
df['submission_date_feature'] = df['submission_date'].as(int)
train_cols = ['submission_date_feature', 'longitude', 'latitude']
label_col = 'new_case'

X_train, y_train, X_test, y_test = train_test_split([df[train_cols], df[label_col]], test_size=0.2)

train_data = Pool(X_train, y_train)
eval_data = Pool(X_test, y_test)

# Initialize CatBoostRegressor
model = CatBoostRegressor(iterations=10,
                          learning_rate=1,
                          depth=3)
# Fit model
model.fit(train_data, eval_set=eval_data)

Upvotes: 1

Finding a formula for a relationship between a target and multiple predictor variables

Answers (3)

Related Questions