Scikit/Numpy/Pandas ValueError: setting an array element with sequence

Question

I had a pandas dataframe that had columns with strings from 0-9 as column names:

working_df = pd.DataFrame(np.random.rand(5,10),index=range(0,5), columns=[str(x) for x in range(10)])
working_df.loc[:,'outcome'] = [0,1,1,0,1]

I then wanted to get an array of all of these numbers into one column so I did:

array_list = [Y for Y in x[[str(num) for num in range(10)]].values]

which gave me:

[array([ 0.0793451 ,  0.3288617 ,  0.75887129,  0.01128641,  0.64105905,
         0.78789297,  0.69673768,  0.20354558,  0.48976411,  0.72848541]),
 array([ 0.53511388,  0.08896322,  0.10302786,  0.08008444,  0.18218731,
         0.2342337 ,  0.52622153,  0.65607384,  0.86069294,  0.8864577 ]),
 array([ 0.82878026,  0.33986175,  0.25707122,  0.96525733,  0.5897311 ,
         0.3884232 ,  0.10943644,  0.26944414,  0.85491211,  0.15801284]),
 array([ 0.31818888,  0.0525836 ,  0.49150727,  0.53682492,  0.78692193,
         0.97945708,  0.53181293,  0.74330327,  0.91364064,  0.49085287]),
 array([ 0.14909577,  0.33959452,  0.20607263,  0.78789116,  0.41780657,
         0.0437907 ,  0.67697385,  0.98579928,  0.1487507 ,  0.41682309])]

I then attached it to my dataframe using:

working_df.loc[:,'array_list'] = pd.Series(array_list)

I then setup my rf_clf = RandomForestClassifier() and I try to rf_clf.fit(working_df['array_list'][1:].values, working_df['outcome'][1:].values) which results in the ValueError: setting an array element with sequence

Is it a problem with the array of arrays in the fitting? Thanks for any insight.

jakevdp · Accepted Answer

The problem is that scikit-learn expects a two-dimensional array of values as input. You're passing a one dimensional array of objects (with each object itself being a one-dimensional array).

A quick fix would be to do this:

X = np.array(list(working_df['array_list'][1:]))
y = working_df['outcome'][1:].values
rf_clf.fit(X, y)

A better fix would be to not store your two-dimensional feature array within a one-dimensional pandas column.

Scikit/Numpy/Pandas ValueError: setting an array element with sequence

Answers (1)

Related Questions