Reputation: 1601
I had a pandas dataframe that had columns with strings from 0-9 as column names:
working_df = pd.DataFrame(np.random.rand(5,10),index=range(0,5), columns=[str(x) for x in range(10)])
working_df.loc[:,'outcome'] = [0,1,1,0,1]
I then wanted to get an array of all of these numbers into one column so I did:
array_list = [Y for Y in x[[str(num) for num in range(10)]].values]
which gave me:
[array([ 0.0793451 , 0.3288617 , 0.75887129, 0.01128641, 0.64105905,
0.78789297, 0.69673768, 0.20354558, 0.48976411, 0.72848541]),
array([ 0.53511388, 0.08896322, 0.10302786, 0.08008444, 0.18218731,
0.2342337 , 0.52622153, 0.65607384, 0.86069294, 0.8864577 ]),
array([ 0.82878026, 0.33986175, 0.25707122, 0.96525733, 0.5897311 ,
0.3884232 , 0.10943644, 0.26944414, 0.85491211, 0.15801284]),
array([ 0.31818888, 0.0525836 , 0.49150727, 0.53682492, 0.78692193,
0.97945708, 0.53181293, 0.74330327, 0.91364064, 0.49085287]),
array([ 0.14909577, 0.33959452, 0.20607263, 0.78789116, 0.41780657,
0.0437907 , 0.67697385, 0.98579928, 0.1487507 , 0.41682309])]
I then attached it to my dataframe using:
working_df.loc[:,'array_list'] = pd.Series(array_list)
I then setup my rf_clf = RandomForestClassifier()
and I try to rf_clf.fit(working_df['array_list'][1:].values, working_df['outcome'][1:].values)
which results in the ValueError: setting an array element with sequence
Is it a problem with the array of arrays in the fitting? Thanks for any insight.
Upvotes: 1
Views: 1761
Reputation: 86330
The problem is that scikit-learn expects a two-dimensional array of values as input. You're passing a one dimensional array of objects (with each object itself being a one-dimensional array).
A quick fix would be to do this:
X = np.array(list(working_df['array_list'][1:]))
y = working_df['outcome'][1:].values
rf_clf.fit(X, y)
A better fix would be to not store your two-dimensional feature array within a one-dimensional pandas column.
Upvotes: 2