Reputation: 3619
I generate a pandas dataframe from read_sql_query
. It has three columns, "results, speed, weight"
I want to use scikit-learn LinearRegression
to fit results = f(speed, weight)
I haven't been able to find the correct syntax that would allow me to pass this dataframe, or column slices of it, to LinearRegression.fit(y, X)
.
print df['result'].shape
print df[['speed', 'weight']].shape
(8L,)
(8, 2)
but I cannot pass that to fit
lm.fit(df['result'], df[['speed', 'weight']])
It throws a deprecation warning
and a ValueError
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19.
ValueError: Found arrays with inconsistent numbers of samples: [1 8]
What is the efficient, clean way to take dataframes of targets and features, and pass them to fit
operations?
This is how I generated the example:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
data2 = np.random.randint(1, high=100, size=len(days))
data3 = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'result': data,'speed': data2,'weight': data3})
df = df.set_index('test')
print(df)
Upvotes: 10
Views: 20545
Reputation: 460
There may be a better way to integrate pandas and sklearn, but one thing that could stop you from doing it the way you're doing it is the shape of y, the results column. It's 1D, but needs to be 2D.
@Valentin Calomme mentioned this, but I like this way of making it 2D better than squeeze()
: just add an extra dimension of brackets.
df['results']
is 1D, but df[['results']]
is 2D. Same data, though.
df['results'].shape
# Out: (8,)
### 1D array
df[['results']].shape
# Out: (8, 1)
### 2D array
As for the order of arguments, that matters only if you don't use the parameter names. I make it a habit to consult the documentation and always explicitly use the parameter names, to avoid mistakes in ordering the arguments, and to know better what I'm doing now and later when I use it again, and because I'm paranoid that developers will monkey around with the argument order haha.
lm.fit(y=df[['result']], X=df[['speed', 'weight']])
### works just as well as
lm.fit(X=df[['speed', 'weight']], y=df[['result']])
Upvotes: 4
Reputation: 618
First of all, fit() takes X, y and not y, X.
Second, it's important to remember is that Scikit-Learn exclusively works with array-like objects. It expects that X has shape (n_samples, n_features) and y to have shape (n_samples,)
It will check for these shapes when you use fit, so if your X, y don't abide by these rules, it will crash. Good news, X already has shape (5,2), but y will have shape (5, 1), which is different than (5,) and so your program might crash.
To be safe, I'd simply transform my X and y as numpy arrays from the start.
X = pd.DataFrame(np.ones((5, 2)))
y = pd.DataFrame(np.ones((5,)))
X = np.array(X)
y = np.array(y).squeeze()
For y to go from shape (5,1) to shape (5,), you need to use .squeeze()
This will give you the right shapes and hopefully the program will run!
Upvotes: 6
Reputation: 1580
Use the code below:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
predefinedFeatureList = ["speed","weight"]
target = "result"
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
data2 = np.random.randint(1, high=100, size=len(days))
data3 = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'result': data,'speed': data2,'weight': data3})
df = df.set_index('test')
print(df)
#results = df['result']
#df.drop(['result'],axis= 1,inplace = True)
lm.fit(df[predefinedFeatureList],df[target]) #LM Fit takes arguments as (X,Y,sample_weights(optional))
Upvotes: 0
Reputation: 36609
You are sending values in incorrect order. All scikit-learn estimators implementing fit() accept input X, y not y, X as you are doing.
Try this:
lm.fit(df[['speed', 'weight']], df['result'])
Upvotes: 12