Jvr
Jvr

Reputation: 563

Array inside list

I'm really confused trying to solve this problem. I'm trying to use the sklearn function: MinMaxScaler but I'm getting an error because it seems to be that I'm setting an array element with a sequence.

The code is:

    raw_values = series.values
    # transform data to be stationary
    diff_series = difference(raw_values, 1); 
    diff_values = diff_series.values; 
    diff_values = diff_values.reshape(len(diff_values), 1) 

    # rescale values to 0,1
    scaler = MinMaxScaler(feature_range=(0, 1))
    scaled_values = scaler.fit_transform(diff_values); print(scaled_values)
    scaled_values = scaled_values.reshape(len(scaled_values), 1)

"series" is a differenced time series that I'm trying to rescale between [0,1] with MinMaxScaler and the Time series was previously differenced in pandas.

I get the following error when running the code: ValueError: setting an array element with a sequence.

Which I don't understand is the fact that if there is just one feature or variable in one column, the code runs all right, but in this case I have 2 features, each one in a different column.

Traceback:

File "C:/....py", line 88, in prepare_data
    scaled_values = scaler.fit_transform(diff_values); print(scaled_values)
  File "C:\Users\name\AppData\Roaming\Python\Python35\site-packages\sklearn\base.py", line 494, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "C:\Users\name\AppData\Roaming\Python\Python35\site-packages\sklearn\preprocessing\data.py", line 292, in fit
    return self.partial_fit(X, y)
  File "C:\Users\name\AppData\Roaming\Python\Python35\site-packages\sklearn\preprocessing\data.py", line 318, in partial_fit
    estimator=self, dtype=FLOAT_DTYPES)
  File "C:\Users\name\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

And this is what I obtain if I print diff_values

[[array([  -1.3,  119. ])]
 [array([ 0.5, -9. ])]
 [array([  0.8,  17. ])]
 ..., 
 [array([   2.8,  742. ])]
 [array([  1.50000000e+00,  -1.65900000e+03])]
 [array([  -2.,  856.])]]

The full code is not mine, it's been obtained from here

EDIT:

Here is my dataset

Just switch the name 'shampoo-sales.csv'to 'datos2.csv' and this sentence:

return datetime.strptime('190'+x, '%Y-%m') 

to this one:

return datetime.strptime(''+x, '%Y-%m-%d')

Upvotes: 1

Views: 1482

Answers (1)

andrew_reece
andrew_reece

Reputation: 21274

In the tutorial you linked to, the object series is actually a Pandas Series. It's a vector of information, with a named index. Your dataset, however, contains two fields of information, in addition to the time series index, which makes it a DataFrame. This is the reason why the tutorial code breaks with your data.

Here's a sample from your data:

import pandas as pd

def parser(x):
    return datetime.strptime(''+x, '%Y-%m-%d')

df = pd.read_csv("datos2.csv", header=None, parse_dates=[0], 
                 index_col=0, squeeze=True, date_parser=parser)
df.head()
               1     2
0                     
2012-01-01  10.9  3736
2012-01-02  10.3  3570
2012-01-03   9.0  3689
2012-01-04   9.5  3680
2012-01-05  10.3  3697

And the equivalent section from the tutorial:
"Running the example loads the dataset as a Pandas Series and prints the first 5 rows."

Month
1901-01-01    266.0
1901-02-01    145.9
1901-03-01    183.1
1901-04-01    119.3
1901-05-01    180.3
Name: Sales, dtype: float64

To verify this, select one of your fields and store it as series, and then try running the MinMaxScaler. You'll see that it runs without error:

series = df[1]
# ... compute difference and do scaling ...
print(scaled_values)
[[ 0.58653846]
 [ 0.55288462]
 [ 0.63942308]
 ..., 
 [ 0.75      ]
 [ 0.6875    ]
 [ 0.51923077]]

Note: One other minor difference in your dataset compared to the tutorial data is that there's no header in your data. Set header=None to avoid assigning your first row of data as column headers.

UPDATE
To pass your entire dataset to MinMaxScaler, just run difference() on both columns and pass in the transformed vectors for scaling. MinMaxScaler accepts an n-dimensional DataFrame object:

ncol = 2
diff_df = pd.concat([difference(df[i], 1) for i in range(1,ncol+1)], axis=1)
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_values = scaler.fit_transform(diff_df)

Upvotes: 1

Related Questions