Reputation: 543
I have a pandas dataframe:
df3:
Text | Topic | Label
some text | 2 | 0
other text | 1 | 0
text 3 | 3 | 1
I divide in training and test set:
x_train, x_test, y_train, y_test = train_test_split(df3[['Text', 'Topic']],df3['Label'], test_size=0.3, random_state=434)
I want to use both Text and Topic feature to predict Label.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC
# pipeline for text data
text_features = df3['Text']
text_transformer = Pipeline(steps=[
('vectorizer', TfidfVectorizer(stop_words="english"))
])
# pipeline for categorical data
categorical_features = df3['Topic']
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
Then, i try to combine input variables:
# combine preprocessing with ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('Text', text_transformer, text_features),
('Topic', categorical_transformer, categorical_features)
])
# add model to be part of pipeline
clf_pipe = Pipeline(steps=[('preprocessor', preprocessor),
("model", SVC())
])
Finally I use fit:
x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)
clf_s= SVC().fit(x_train, y_train)
clf_s.score(x_test, y_test)
Output says:
"ValueError: A given column is not a column of the dataframe"
The error is refereed to the line:
x_train = preprocessor.fit_transform(x_train)
Where did I go wrong?
Upvotes: 1
Views: 223
Reputation: 4098
The transformers
tuple is not created correctly. If you refer the dcoumentation of ColumnTransformer, the last parameter needs correction. It is corrected as ('Text', text_transformer, "Text")
and ('Topic', categorical_transformer, ["Topic"])
. Refer the examples in above link.
Alternatively, you can also use ColumnSelector
from mlxtend
. Please refer this post.
After you fix this, you need sample with all types of labels in y_train
and y_test
. So, add some more data.
One more problem I see is that you are fitting transformer on train features and using it to transform test dataset features. Usually, you should fit the full dataset and then transform train and test, else you will miss some features like one hot encoding may consider unseen labels as unknown etc.
Here is the working code. The changed lines are marked with comment # Changed this line
:
df3 = pd.DataFrame(data=[["some text", 2, 0],["other text", 1, 0],["text 3", 3, 1],["text 3", 3, 0],["text 3", 3, 1],["text 3", 3, 0]], columns=["Text", "Topic", "Label"]) # Changed this line
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df3[['Text', 'Topic']],df3['Label'], test_size=0.3, random_state=434)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC
# pipeline for text data
# text_features = df3['Text']
text_features = ["some text", "other text", "text 3"]
text_transformer = Pipeline(steps=[
('vectorizer', TfidfVectorizer(stop_words="english"))
])
# pipeline for categorical data
# categorical_features = df3['Topic']
categorical_features = [2, 1, 3]
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value=0)),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# combine preprocessing with ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('Text', text_transformer, "Text"), # Changed this line
('Topic', categorical_transformer, ["Topic"]) # Changed this line
])
# add model to be part of pipeline
clf_pipe = Pipeline(steps=[('preprocessor', preprocessor),
("model", SVC())
])
# _ = preprocessor.fit_transform(df3)
x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)
clf_s = SVC().fit(x_train, y_train)
clf_s.score(x_test, y_test)
Upvotes: 1