Katya Willard
Katya Willard

Reputation: 2182

Get from Pandas dataframe column to features for scikit-learn model

Let's say I have a dataframe that looks like this:

import pandas as pd
import numpy as np


vectors = pd.Series([[1.0, 2.0, 3.0], [0.5, 1.5, 2.5], [0.1, 1.1, 2.1]], name='vector')
output = pd.Series([True, False, True], name='target')

data = pd.concat((vectors, output), axis=1)

data looks like this: a Series of lists of floats, and a Series of booleans:

            vector  target
0  [1.0, 2.0, 3.0]    True
1  [0.5, 1.5, 2.5]   False
2  [0.1, 1.1, 2.1]    True

Now, I want to fit a simple scikit-learn LogisticRegression model on top of the vectors to predict the target output.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X=data['vector'], y=data['target'])

This does not work, with the error:

ValueError: setting an array element with a sequence

I tried casting my vector data to an np array first, with

data['vector'].apply(np.array)

But this yields the same error as before.

I can get it to work by executing the following:

input_vectors = np.array(data['vector'].to_list())
clf.fit(X=input_vectors, y=data['target'])

But this seems quite clunky and bulky - I turn the entire pandas array into a list, then turn it into a numpy array.

I'm wondering if there is a better method here for converting this data format into one that is acceptable to scikit-learn. In reality, my datasets are much larger and this transformation is expensive. Given how compatible scikit-learn and pandas normally are, I imagine I might be missing something.

Upvotes: 2

Views: 622

Answers (2)

mozway
mozway

Reputation: 262124

You should pass an array to clf.fit, not a list / Series of arrays.

Use numpy.vstack:

import numpy as np
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X=np.vstack(data['vector']), y=data['target'])

clf.coef_
# array([[0.02622973, 0.02623115, 0.02623258]])

clf.intercept_
# array([0.57262013]))

Upvotes: 0

Nathan Furnal
Nathan Furnal

Reputation: 2410

Since you know the number of columns, how about:

 X = data["vector"].explode().values.astype(float).reshape(-1, 3)

This will explose the lists into a single series, get the numpy values, convert them to the proper type (you could use np.float32 as well since the values don't seem too large) and then reshape with the proper number of columns.

Upvotes: 0

Related Questions