Reputation: 1
I'm doing a text authorship attribution model. The classifier is SVM (linear kernel), and I want to use cross_val_score from sklearn.model_selection for evaluation.
The question is how to feed to the classifier via pipeline different features, mainly custom, not from libraries' transformers (e.g. average sentence length, frequency of punctuation marks, vocabulary richness, etc.) to train classifier considering all of them.
This code for standard library transformer tf-idf works great:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
# example of data
data = [['Anton', "The revival of the 2015 festival ushered in live music from iconic Filipino talents such as Barbie Almalbis, Kevin Toy’s, and Hilera, which had the beaches of San Juan flowing with good vibes."],
['Anton', "Tip: For a hassle-free experience, make sure to pre-book online with Biyaheroes.com, which makes public transport so much easier, even for first-time commuters. With their real-time seat and schedule selectors, commuters get a very useful overview of their trip schedules so they can plan ahead."],
['Anton', "Hungry surfers and sun worshipers can easily walk along the beach and on the parallel road, where lanes of restaurants offer a wide array of cuisines. There are also a number of cafes and food stalls to choose from."],
['Brendan', 'Today, I’m back here again, and again reminded of what makes Alberta such a brilliant place to travel: its diversity. I left the edge of the snow-covered Rocky Mountains in the morning, and by midday I’m here in the dry heat of the desert and prairies looking down on a valley of still water and stone figures.'],
['Brendan', "A life in which I spent my nights sipping exotic drinks, nibbling on strange foods, and diving head first into the local night life. All at the same time I feel scared. But unlike most people this is the part I love. I love being scared, because travel has taught when you’re scared you’re probably about to embark on something incredible."],
['Brendan', "Of the 44 kilometers of trail, about 25 of those take hikers above the treeline. And well the trail isn’t exactly super challenging, most of it is fairly flat aside from a couple sections, it does take you to parts of the mountains that usually require extreme hikes to get to."],
['Dave', 'If anyone has a fun personality and wants to start living abroad, I’d definitely recommend applying to be tour guide around Europe!'],
['Dave', 'I found myself a decent job, a great shared house to live in, and had an amazing crew to hang out with every weekend. I was no longer a nomad. Sydney became more than just another travel destination, it became my second home.'],
['Dave', "I immediately fell in love with the long-term backpacking culture, the budget travel options in South-East Asia, and treating the world as my classroom. Traveling during your twenties is so important, and I’m so happy I figured out this was an option!"],
['Derek', "The other day I received an email from a reader asking me to confirm the proper way to bargain in foreign countries. The ‘proper way’ that was mentioned is something that I’ve heard from travelers all the time. It’s the 50% rule. And to me, the rule is wrong."],
['Derek', "If you see something you want to purchase, visit 2-3 other shops nearby that sell the same thing or something similar. Ask how much it costs at each of the shops. This will give you a general idea of a true starting price for negotiations. If one shop quotes you $50, another quotes $35 and another one quotes you $20, you know the actual price is below $20."],
['Derek', "As travel becomes more and more popular and commonplace though, such tourist crowds seem to be the norm all over the world. Walking down the street in many destinations requires a lot of focus in order to avoid bumping into strollers, lost tourists and group leaders that don’t seem to mind taking over the sidewalks."]]
df = pd.DataFrame(data, columns = ['author', 'text'])
# define data set
X = df['text']
# define labels set; transform non-numerical labels to numerical labels
labelEncoder = preprocessing.LabelEncoder()
y = labelEncoder \
.fit(df['author'].unique()) \
.transform(df['author'].values)
# create pipeline
pipeline = Pipeline([
('tf_idf', TfidfVectorizer()),
('classifier', svm.SVC(kernel='linear'))
])
# cross-validation
scores_pipe = cross_val_score(pipeline, X, y, scoring='accuracy', cv=2)
mean_pipe_score = scores_pipe.mean()
print("Accuracy for tf-idf:", mean_pipe_score)
The problems come when I try to create a custom transformer class (using examples from here). I get warnings and Accuracy = nan.
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# example of data
data = [['Anton', "The revival of the 2015 festival ushered in live music from iconic Filipino talents such as Barbie Almalbis, Kevin Toy’s, and Hilera, which had the beaches of San Juan flowing with good vibes."],
['Anton', "Tip: For a hassle-free experience, make sure to pre-book online with Biyaheroes.com, which makes public transport so much easier, even for first-time commuters. With their real-time seat and schedule selectors, commuters get a very useful overview of their trip schedules so they can plan ahead."],
['Anton', "Hungry surfers and sun worshipers can easily walk along the beach and on the parallel road, where lanes of restaurants offer a wide array of cuisines. There are also a number of cafes and food stalls to choose from."],
['Brendan', 'Today, I’m back here again, and again reminded of what makes Alberta such a brilliant place to travel: its diversity. I left the edge of the snow-covered Rocky Mountains in the morning, and by midday I’m here in the dry heat of the desert and prairies looking down on a valley of still water and stone figures.'],
['Brendan', "A life in which I spent my nights sipping exotic drinks, nibbling on strange foods, and diving head first into the local night life. All at the same time I feel scared. But unlike most people this is the part I love. I love being scared, because travel has taught when you’re scared you’re probably about to embark on something incredible."],
['Brendan', "Of the 44 kilometers of trail, about 25 of those take hikers above the treeline. And well the trail isn’t exactly super challenging, most of it is fairly flat aside from a couple sections, it does take you to parts of the mountains that usually require extreme hikes to get to."],
['Dave', 'If anyone has a fun personality and wants to start living abroad, I’d definitely recommend applying to be tour guide around Europe!'],
['Dave', 'I found myself a decent job, a great shared house to live in, and had an amazing crew to hang out with every weekend. I was no longer a nomad. Sydney became more than just another travel destination, it became my second home.'],
['Dave', "I immediately fell in love with the long-term backpacking culture, the budget travel options in South-East Asia, and treating the world as my classroom. Traveling during your twenties is so important, and I’m so happy I figured out this was an option!"],
['Derek', "The other day I received an email from a reader asking me to confirm the proper way to bargain in foreign countries. The ‘proper way’ that was mentioned is something that I’ve heard from travelers all the time. It’s the 50% rule. And to me, the rule is wrong."],
['Derek', "If you see something you want to purchase, visit 2-3 other shops nearby that sell the same thing or something similar. Ask how much it costs at each of the shops. This will give you a general idea of a true starting price for negotiations. If one shop quotes you $50, another quotes $35 and another one quotes you $20, you know the actual price is below $20."],
['Derek', "As travel becomes more and more popular and commonplace though, such tourist crowds seem to be the norm all over the world. Walking down the street in many destinations requires a lot of focus in order to avoid bumping into strollers, lost tourists and group leaders that don’t seem to mind taking over the sidewalks."]]
df = pd.DataFrame(data, columns = ['author', 'text'])
# define data set
X = df['text']
# define labels set; transform non-numerical labels to numerical labels
labelEncoder = preprocessing.LabelEncoder()
y = labelEncoder \
.fit(df['author'].unique()) \
.transform(df['author'].values)
# extracts given columns from df
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__( self, feature_names ):
self._feature_names = feature_names
def fit( self, X, y = None ):
return self
def transform( self, X, y = None ):
return X[self._feature_names]
# counts the frequency of ! among all the chars
def count_exclamationMark(text):
counter = 0
for char in text:
if char == "!":
counter +=1
return counter / len(text)
# Transforming column of text data into frequencies of !
class ExclamationTransformer(BaseEstimator,TransformerMixin):
#Class Constructor
def __init__(self, exclamation = True):
self._exclamation = exclamation
#Return self, nothing else to do here
def fit( self, X, y = None ):
return self
#Custom transform method we wrote that creates aformentioned features and drops redundant ones
def transform(self, X, y = None):
#Check if needed
if self._exclamation:
#create new column
X['exclamations'] = X['text'].apply(count_exclamationMark)
X = X.drop('text', axis = 1)
#Converting any infinity values in the dataset to Nan
X = X.replace([ np.inf, -np.inf ], np.nan)
#returns a numpy array
return X.values
# When I implement these classes manually in train-and-test approach, everythin works
columns = ['text']
selector = ColumnSelector(columns)
a = selector.transform(df)
exclamator = ExclamationTransformer(exclamation=1)
b = exclamator.transform(a)
X_train, X_test, y_train, y_test = train_test_split(b, y, test_size=0.3,random_state=1, stratify=y)
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
print("Accuracy for exclamations: ",accuracy_score(y_test, predictions))
# Output: Accuracy for exclamations: 0.25
pipeline = Pipeline([
('text_extraction',ColumnSelector(columns)),
('exclamations', ExclamationTransformer(exclamation=1)),
('classifier', svm.SVC(kernel='linear'))
])
# When it comes to this part I get a warning and an error listed bellow
scores_pipe = cross_val_score(pipeline, df, y, scoring='accuracy', cv=2)
mean_pipe_score = scores_pipe.mean()
print("Accuracy for exclamations:", mean_pipe_score)
#Output: Accuracy for exclamations: nan
Warning message:
FutureWarning: From version 0.24, get_params will raise an AttributeError if a parameter cannot be retrieved as an instance attribute. Previously it would return None.
FutureWarning)
FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
KeyError: None
FitFailedWarning)
I have spent hours, but still have no idea of what's wrong and how to feed a custom feature to pipeline, not to mention multiple custom features combined with typical vectorizers. Does anyone have any idea of why it happens or how to fix it?
Upvotes: 0
Views: 115
Reputation: 1
You should use the same attribute names in the methods of the (BaseEstimator, TransformerMixin) classes and attributes of the (BaseEstimator, TransformerMixin) class itself. For example:
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, feature_names):
self.feature_names = feature_names
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X[self.feature_names]
Upvotes: 0