Reputation: 563
I'm really confused trying to solve this problem. I'm trying to use the sklearn function: MinMaxScaler
but I'm getting an error because it seems to be that I'm setting an array element with a sequence.
The code is:
raw_values = series.values
# transform data to be stationary
diff_series = difference(raw_values, 1);
diff_values = diff_series.values;
diff_values = diff_values.reshape(len(diff_values), 1)
# rescale values to 0,1
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_values = scaler.fit_transform(diff_values); print(scaled_values)
scaled_values = scaled_values.reshape(len(scaled_values), 1)
"series" is a differenced time series that I'm trying to rescale between [0,1] with MinMaxScaler
and the Time series was previously differenced in pandas.
I get the following error when running the code:
ValueError: setting an array element with a sequence.
Which I don't understand is the fact that if there is just one feature
or variable in one column, the code runs all right, but in this case I have 2 features
, each one in a different column.
Traceback:
File "C:/....py", line 88, in prepare_data
scaled_values = scaler.fit_transform(diff_values); print(scaled_values)
File "C:\Users\name\AppData\Roaming\Python\Python35\site-packages\sklearn\base.py", line 494, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "C:\Users\name\AppData\Roaming\Python\Python35\site-packages\sklearn\preprocessing\data.py", line 292, in fit
return self.partial_fit(X, y)
File "C:\Users\name\AppData\Roaming\Python\Python35\site-packages\sklearn\preprocessing\data.py", line 318, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "C:\Users\name\AppData\Roaming\Python\Python35\site-packages\sklearn\utils\validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
And this is what I obtain if I print diff_values
[[array([ -1.3, 119. ])]
[array([ 0.5, -9. ])]
[array([ 0.8, 17. ])]
...,
[array([ 2.8, 742. ])]
[array([ 1.50000000e+00, -1.65900000e+03])]
[array([ -2., 856.])]]
The full code is not mine, it's been obtained from here
EDIT:
Here is my dataset
Just switch the name 'shampoo-sales.csv'to 'datos2.csv' and this sentence:
return datetime.strptime('190'+x, '%Y-%m')
to this one:
return datetime.strptime(''+x, '%Y-%m-%d')
Upvotes: 1
Views: 1482
Reputation: 21274
In the tutorial you linked to, the object series
is actually a Pandas Series
. It's a vector of information, with a named index. Your dataset, however, contains two fields of information, in addition to the time series index, which makes it a DataFrame
. This is the reason why the tutorial code breaks with your data.
Here's a sample from your data:
import pandas as pd
def parser(x):
return datetime.strptime(''+x, '%Y-%m-%d')
df = pd.read_csv("datos2.csv", header=None, parse_dates=[0],
index_col=0, squeeze=True, date_parser=parser)
df.head()
1 2
0
2012-01-01 10.9 3736
2012-01-02 10.3 3570
2012-01-03 9.0 3689
2012-01-04 9.5 3680
2012-01-05 10.3 3697
And the equivalent section from the tutorial:
"Running the example loads the dataset as a Pandas Series and prints the first 5 rows."
Month
1901-01-01 266.0
1901-02-01 145.9
1901-03-01 183.1
1901-04-01 119.3
1901-05-01 180.3
Name: Sales, dtype: float64
To verify this, select one of your fields and store it as series
, and then try running the MinMaxScaler
. You'll see that it runs without error:
series = df[1]
# ... compute difference and do scaling ...
print(scaled_values)
[[ 0.58653846]
[ 0.55288462]
[ 0.63942308]
...,
[ 0.75 ]
[ 0.6875 ]
[ 0.51923077]]
Note: One other minor difference in your dataset compared to the tutorial data is that there's no header in your data. Set header=None
to avoid assigning your first row of data as column headers.
UPDATE
To pass your entire dataset to MinMaxScaler
, just run difference()
on both columns and pass in the transformed vectors for scaling. MinMaxScaler
accepts an n-dimensional DataFrame
object:
ncol = 2
diff_df = pd.concat([difference(df[i], 1) for i in range(1,ncol+1)], axis=1)
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_values = scaler.fit_transform(diff_df)
Upvotes: 1