user3556757
user3556757

Reputation: 3619

efficiently passing dataframes as y and X to scikit-learn fits

I generate a pandas dataframe from read_sql_query. It has three columns, "results, speed, weight"

I want to use scikit-learn LinearRegression to fit results = f(speed, weight)

I haven't been able to find the correct syntax that would allow me to pass this dataframe, or column slices of it, to LinearRegression.fit(y, X).

print df['result'].shape
print df[['speed', 'weight']].shape
(8L,)
(8, 2)

but I cannot pass that to fit

lm.fit(df['result'], df[['speed', 'weight']])

It throws a deprecation warning and a ValueError

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. 
ValueError: Found arrays with inconsistent numbers of samples: [1 8]

What is the efficient, clean way to take dataframes of targets and features, and pass them to fit operations?

This is how I generated the example:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta

date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')

np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
data2 = np.random.randint(1, high=100, size=len(days))
data3 = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'result': data,'speed': data2,'weight': data3})
df = df.set_index('test')
print(df)

Upvotes: 10

Views: 20545

Answers (4)

Kaleb Coberly
Kaleb Coberly

Reputation: 460

There may be a better way to integrate pandas and sklearn, but one thing that could stop you from doing it the way you're doing it is the shape of y, the results column. It's 1D, but needs to be 2D.

@Valentin Calomme mentioned this, but I like this way of making it 2D better than squeeze(): just add an extra dimension of brackets.

df['results'] is 1D, but df[['results']] is 2D. Same data, though.

df['results'].shape
# Out: (8,)
### 1D array

df[['results']].shape
# Out: (8, 1)
### 2D array

As for the order of arguments, that matters only if you don't use the parameter names. I make it a habit to consult the documentation and always explicitly use the parameter names, to avoid mistakes in ordering the arguments, and to know better what I'm doing now and later when I use it again, and because I'm paranoid that developers will monkey around with the argument order haha.

lm.fit(y=df[['result']], X=df[['speed', 'weight']])

### works just as well as

lm.fit(X=df[['speed', 'weight']], y=df[['result']])

Upvotes: 4

Valentin Calomme
Valentin Calomme

Reputation: 618

First of all, fit() takes X, y and not y, X.

Second, it's important to remember is that Scikit-Learn exclusively works with array-like objects. It expects that X has shape (n_samples, n_features) and y to have shape (n_samples,)

It will check for these shapes when you use fit, so if your X, y don't abide by these rules, it will crash. Good news, X already has shape (5,2), but y will have shape (5, 1), which is different than (5,) and so your program might crash.

To be safe, I'd simply transform my X and y as numpy arrays from the start.

X = pd.DataFrame(np.ones((5, 2)))
y = pd.DataFrame(np.ones((5,)))

X = np.array(X)
y = np.array(y).squeeze()

For y to go from shape (5,1) to shape (5,), you need to use .squeeze() This will give you the right shapes and hopefully the program will run!

Upvotes: 6

Bhushan Pant
Bhushan Pant

Reputation: 1580

Use the code below:

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from sklearn.linear_model import LinearRegression

lm = LinearRegression() 
predefinedFeatureList = ["speed","weight"]
target = "result"

date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')

np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
data2 = np.random.randint(1, high=100, size=len(days))
data3 = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'result': data,'speed': data2,'weight': data3})
df = df.set_index('test')
print(df)
#results = df['result']
#df.drop(['result'],axis= 1,inplace = True)
lm.fit(df[predefinedFeatureList],df[target]) #LM Fit takes arguments as (X,Y,sample_weights(optional))

Upvotes: 0

Vivek Kumar
Vivek Kumar

Reputation: 36609

You are sending values in incorrect order. All scikit-learn estimators implementing fit() accept input X, y not y, X as you are doing.

Try this:

lm.fit(df[['speed', 'weight']], df['result'])

Upvotes: 12

Related Questions