Reputation: 113
I want one of my ONLY ONE of my features to be converted to a separate binary features:
df["pattern_id"]
Out[202]:
0 3
1 3
...
7440 2
7441 2
7442 3
Name: pattern_id, Length: 7443, dtype: int64
df["pattern_id"]
Out[202]:
0 0 0 1
1 0 0 1
...
7440 0 1 0
7441 0 1 0
7442 0 0 1
Name: pattern_id, Length: 7443, dtype: int64
I want to use OneHotEncoder, data is int, so no need to encode it:
onehotencoder = OneHotEncoder(categorical_features=["pattern_id"])
df = onehotencoder.fit_transform(df).toarray()
ValueError: could not convert string to float: 'http://www.zaragoza.es/sedeelectronica/'
Interesting enough I receive an error... sklearn tried to encode another column, not the one I wanted.
We have to encode pattern_id to be an integer value
I used this link: Issue with OneHotEncoder for categorical features
#transform the pattern_id feature to int
encoding_feature = ["pattern_id"]
enc = LabelEncoder()
enc.fit(encoding_feature)
working_feature = enc.transform(encoding_feature)
working_feature = working_feature.reshape(-1, 1)
ohe = OneHotEncoder(sparse=False)
#convert the pattern_id feature to separate binary features
onehotencoder = OneHotEncoder(categorical_features=working_feature, sparse=False)
df = onehotencoder.fit_transform(df).toarray()
And I get the same error. What am I doing wrong ?
source: https://github.com/martin-varbanov96/scraper/blob/master/logo_scrape/logo_scrape/analysis.py
df
Out[259]:
found_img is_http link_img \
0 True 0 img/aahoteles.svg
//www.zaragoza.es/cont/paginas/img/sede/logo_e...
pattern_id current_link site_id \
0 3 https://www.aa-hoteles.com/es/reservas 3
6 3 https://www.aa-hoteles.com/es/ofertas-hoteles 3
7 2 http://about.pressreader.com/contact-us/ 4
8 3 http://about.pressreader.com/contact-us/ 4
status link_id
0 200 https://www.aa-hoteles.com/
1 200 https://www.365travel.asia/
2 200 https://www.365travel.asia/
3 200 https://www.365travel.asia/
4 200 https://www.aa-hoteles.com/
5 200 https://www.aa-hoteles.com/
6 200 https://www.aa-hoteles.com/
7 200 http://about.pressreader.com
8 200 http://about.pressreader.com
9 200 https://www.365travel.asia/
10 200 https://www.365travel.asia/
11 200 https://www.365travel.asia/
12 200 https://www.365travel.asia/
13 200 https://www.365travel.asia/
14 200 https://www.365travel.asia/
15 200 https://www.365travel.asia/
16 200 https://www.365travel.asia/
17 200 https://www.365travel.asia/
18 200 http://about.pressreade
[7443 rows x 8 columns]
Upvotes: 4
Views: 4417
Reputation: 31
The recommended way to work with different column types is detailed in the sklearn documentation here.
Representative example:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
Upvotes: 0
Reputation: 21
You can also do it like this
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(df.required_column.values.reshape(-1, 1)).toarray()
We need to reshape the column, because fit_transform
requires a 2-D array. Then you can add columns to this numpy array and then merge it with your DataFrame.
Seen from this link here
Upvotes: 2
Reputation: 5355
If you take a look at the documentation for OneHotEncoder
you can see that the categorical_features
argument expects '“all” or array of indices or mask' not a string. You can make your code work by changing to the following lines
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Create a dataframe of random ints
df = pd.DataFrame(np.random.randint(0, 4, size=(100, 4)),
columns=['pattern_id', 'B', 'C', 'D'])
onehotencoder = OneHotEncoder(categorical_features=[df.columns.tolist().index('pattern_id')])
df = onehotencoder.fit_transform(df)
However df
will no longer be a DataFrame
, I would suggest working directly with the numpy arrays.
Upvotes: 2