Reputation: 843
I want to build a text classifier with sklearn and then convert it to iOS11 machine learning file using coremltools package. I've built three different classifiers with Logistic Regression, Random Forest, and Linear SVC and all of them work fine in Python. The problem is the coremltools package and the way it converts the sklearn model to an iOS file. As its documentation says, it only supports these models:
So it doesn't allow me to vectorize my text dataset (I've used TfidfVectorizer package in my classifiers):
import coremltools
coreml_model = coremltools.converters.sklearn.convert(model, input_features='text', output_feature_names='category')
Traceback (most recent call last):
File "<ipython-input-3-97beddbdad10>", line 1, in <module>
coreml_model = coremltools.converters.sklearn.convert(pipeline, input_features='Message', output_feature_names='Label')
File "/usr/local/lib/python2.7/dist-packages/coremltools/converters/sklearn/_converter.py", line 146, in convert
sk_obj, input_features, output_feature_names, class_labels = None)
File "/usr/local/lib/python2.7/dist-packages/coremltools/converters/sklearn/_converter_internal.py", line 147, in _convert_sklearn_model
for sk_obj_name, sk_obj in sk_obj_list]
File "/usr/local/lib/python2.7/dist-packages/coremltools/converters/sklearn/_converter_internal.py", line 97, in _get_converter_module
",".join(k.__name__ for k in _converter_module_list)))
ValueError: Transformer 'TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=3,
ngram_range=(1, 2), norm=u'l2', preprocessor=None, smooth_idf=1,
stop_words='english', strip_accents='unicode', sublinear_tf=1,
token_pattern='\\w+', tokenizer=None, use_idf=1, vocabulary=None)' not supported;
supported transformers are coremltools.converters.sklearn._dict_vectorizer,coremltools.converters.sklearn._one_hot_encoder,coremltools.converters.sklearn._normalizer,coremltools.converters.sklearn._standard_scaler,coremltools.converters.sklearn._imputer,coremltools.converters.sklearn._NuSVC,coremltools.converters.sklearn._NuSVR,coremltools.converters.sklearn._SVC,coremltools.converters.sklearn._SVR,coremltools.converters.sklearn._linear_regression,coremltools.converters.sklearn._LinearSVC,coremltools.converters.sklearn._LinearSVR,coremltools.converters.sklearn._logistic_regression,coremltools.converters.sklearn._random_forest_classifier,coremltools.converters.sklearn._random_forest_regressor,coremltools.converters.sklearn._decision_tree_classifier,coremltools.converters.sklearn._decision_tree_regressor,coremltools.converters.sklearn._gradient_boosting_classifier,coremltools.converters.sklearn._gradient_boosting_regressor.
Is there any way to build a sklearn text classifier and not use TfidfVectorizer or CountVectorizer models?
Upvotes: 0
Views: 586
Reputation: 21
Right now you can't include a tf-idf vectorizer in your pipeline if you want to convert it to the .mlmodel format. The way around this is to vectorize your data separately and then train the model (Linear SVC, Random Forest, ...) with the vectorized data. You need to then calculate the tf-idf representation on device which you can then plug into the model. Here's a copy of the tf-idf function I wrote.
func tfidf(document: String) -> MLMultiArray{
let wordsFile = Bundle.main.path(forResource: "words_ordered", ofType: "txt")
let dataFile = Bundle.main.path(forResource: "data", ofType: "txt")
do {
let wordsFileText = try String(contentsOfFile: wordsFile!, encoding: String.Encoding.utf8)
var wordsData = wordsFileText.components(separatedBy: .newlines)
let dataFileText = try String(contentsOfFile: dataFile!, encoding: String.Encoding.utf8)
var data = dataFileText.components(separatedBy: .newlines)
let wordsInMessage = document.split(separator: " ")
var vectorized = try MLMultiArray(shape: [NSNumber(integerLiteral: wordsData.count)], dataType: MLMultiArrayDataType.double)
for i in 0..<wordsData.count{
let word = wordsData[i]
if document.contains(word){
var wordCount = 0
for substr in wordsInMessage{
if substr.elementsEqual(word){
wordCount += 1
}
}
let tf = Double(wordCount) / Double(wordsInMessage.count)
var docCount = 0
for line in data{
if line.contains(word) {
docCount += 1
}
}
let idf = log(Double(data.count) / Double(docCount))
vectorized[i] = NSNumber(value: tf * idf)
} else {
vectorized[i] = 0.0
}
}
return vectorized
} catch {
return MLMultiArray()
}
}
Edit: Wrote up a whole post on how to do this at http://gokulswamy.me/imessage-spam-detection/.
Upvotes: 1