Reputation: 2182
Let's say I have a dataframe that looks like this:
import pandas as pd
import numpy as np
vectors = pd.Series([[1.0, 2.0, 3.0], [0.5, 1.5, 2.5], [0.1, 1.1, 2.1]], name='vector')
output = pd.Series([True, False, True], name='target')
data = pd.concat((vectors, output), axis=1)
data
looks like this: a Series of lists of floats, and a Series of booleans:
vector target
0 [1.0, 2.0, 3.0] True
1 [0.5, 1.5, 2.5] False
2 [0.1, 1.1, 2.1] True
Now, I want to fit a simple scikit-learn LogisticRegression model on top of the vectors to predict the target output.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X=data['vector'], y=data['target'])
This does not work, with the error:
ValueError: setting an array element with a sequence
I tried casting my vector data to an np array first, with
data['vector'].apply(np.array)
But this yields the same error as before.
I can get it to work by executing the following:
input_vectors = np.array(data['vector'].to_list())
clf.fit(X=input_vectors, y=data['target'])
But this seems quite clunky and bulky - I turn the entire pandas array into a list, then turn it into a numpy array.
I'm wondering if there is a better method here for converting this data format into one that is acceptable to scikit-learn. In reality, my datasets are much larger and this transformation is expensive. Given how compatible scikit-learn and pandas normally are, I imagine I might be missing something.
Upvotes: 2
Views: 622
Reputation: 262124
You should pass an array to clf.fit
, not a list / Series of arrays.
Use numpy.vstack
:
import numpy as np
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X=np.vstack(data['vector']), y=data['target'])
clf.coef_
# array([[0.02622973, 0.02623115, 0.02623258]])
clf.intercept_
# array([0.57262013]))
Upvotes: 0
Reputation: 2410
Since you know the number of columns, how about:
X = data["vector"].explode().values.astype(float).reshape(-1, 3)
This will explose the lists into a single series, get the numpy values, convert them to the proper type (you could use np.float32
as well since the values don't seem too large) and then reshape with the proper number of columns.
Upvotes: 0