Saad
Saad

Reputation: 1360

scikit-learn: FeatureUnion to include hand crafted features

I am performing multi-label classification on text data. I wish to use combined features of tfidf and custom linguistic features similar to the example here using FeatureUnion.

I already have generated the custom linguistic features, which are in the form of a dictionary where keys represent the labels and (list of) values represent the features.

custom_features_dict = {'contact':['contact details', 'e-mail'], 
                       'demographic':['gender', 'age', 'birth'],
                       'location':['location', 'geo']}

Training data structure is as follows:

text                                            contact  demographic  location
---                                              ---      ---          ---
'provide us with your date of birth and e-mail'  1        1            0
'contact details and location will be stored'    1        0            1
'date of birth should be before 2004'            0        1            0

How can the above dict be incorporated into FeatureUnion? My understanding is that a user-defined function should be called that returns boolean values corresponding to the presence or absence of string values (from custom_features_dict) in the training data.

This gives the following list of dict for the given training data:

[
    {
       'contact':1,
       'demographic':1,
       'location':0
    },
    {
       'contact':1,
       'demographic':0,
       'location':1
    },
    {
       'contact':0,
       'demographic':1,
       'location':0
    },
] 

How can the above list be used to implement fit and transform?

The code is given below:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
#from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from io import StringIO

data = StringIO(u'''text,contact,demographic,location
provide us with your date of birth and e-mail,1,1,0
contact details and location will be stored,0,1,1
date of birth should be before 2004,0,1,0''')

df = pd.read_csv(data)

custom_features_dict = {'contact':['contact details', 'e-mail'], 
                        'demographic':['gender', 'age', 'birth'],
                        'location':['location', 'geo']}

my_features = [
    {
       'contact':1,
       'demographic':1,
       'location':0
    },
    {
       'contact':1,
       'demographic':0,
       'location':1
    },
    {
       'contact':0,
       'demographic':1,
       'location':0
    },
]

bow_pipeline = Pipeline(
    steps=[
        ("tfidf", TfidfVectorizer(stop_words=stop_words)),
    ]
)

manual_pipeline = Pipeline(
    steps=[
        # This needs to be fixed
        ("custom_features", my_features),
        ("dict_vect", DictVectorizer()),
    ]
)

combined_features = FeatureUnion(
    transformer_list=[
        ("bow", bow_pipeline),
        ("manual", manual_pipeline),
    ]
)

final_pipeline = Pipeline([
            ('combined_features', combined_features),
            ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
        ]
)

labels = ['contact', 'demographic', 'location']

for label in labels:
    final_pipeline.fit(df['text'], df[label]) 

Upvotes: 1

Views: 498

Answers (2)

chefhose
chefhose

Reputation: 2694

You have to define a Transformer which takes your text as input. Something like that:

from sklearn.base import BaseEstimator, TransformerMixin

custom_features_dict = {'contact':['contact details', 'e-mail'], 
                   'demographic':['gender', 'age', 'birth'],
                   'location':['location', 'geo']}

#helper function which returns 1, if one of the words occures in the text, else 0
#you can add more words or categories to custom_features_dict if you want
def is_words_present(text, listofwords):
  for word in listofwords:
    if word in text:
      return 1
  return 0

class CustomFeatureTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, custom_feature_dict):
       self.custom_feature_dict = custom_feature_dict
    def fit(self, x, y=None):
        return self    
    def transform(self, data):
        result_arr = []
        for text in data:
          arr = []
          for key in self.custom_feature_dict:
            arr.append(is_words_present(text, self.custom_feature_dict[key]))
          result_arr.append(arr)
        return result_arr

Note: This Transformer generates an array directly looking like this: [1, 0, 1], it does not generate a dictionary, which allows us to spare the DictVectorizer.

Additionally I changed the way how to process the Multilabel-classification, see here:

#first, i generate a new column in the dataframe, with all the labels per row:
def create_textlabels_array(row):
  arr = []
  for label in ['contact', 'demographic', 'location']:
    if row[label]==1:
      arr.append(label)
  return arr

df['textlabels'] = df.apply(create_textlabels_array, 1) 

#then we generate the binarized Labels:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer().fit(df['textlabels'])
y = mlb.transform(df['textlabels'])

Now we can add all together to the pipeline:

bow_pipeline = Pipeline(
    steps=[
        ("tfidf", TfidfVectorizer(stop_words=stop_words)),
    ]
)

manual_pipeline = Pipeline(
    steps=[
        ("costum_vect", CustomFeatureTransformer(custom_features_dict)),
    ]
)

combined_features = FeatureUnion(
    transformer_list=[
        ("bow", bow_pipeline),
        ("manual", manual_pipeline),
    ]
)

final_pipeline = Pipeline([
        ('combined_features', combined_features),
        ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
    ]
)

#train your pipeline
final_pipeline.fit(df['text'], y) 

#let's predict something: (Note: of course training data is a bit low in that examplecase here)
pred = final_pipeline.predict(["write an e-mail to our location please"])
print(pred) #output: [0, 1, 1] 

#reverse the predicted array to the actual labels:
print(mlb.inverse_transform(pred)) #output: [('demographic', 'location')]

Upvotes: 1

user3357359
user3357359

Reputation: 169

If we just want to fix that part of the code marked to be fixed, all that we need is to implement a new estimator extending the class sklearn.base.BaseEstimator (the class TemplateClassifier is a good example here).

However it seems there is a conceptual mistake here. The information in the list my_features seems to be the labels themselves (well, one could argue that they are very strong features...). So, we should not put the labels in the features pipeline.

As described here,

Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).

That said, if you still want to put that list information in a transform method, it would be something like this:

def transform_str(one_line_text: str) -> dict:
    """ Transforms one line of text to dict features using manually extracted information"""
    # manually extracted information
    custom_features_dict = {'contact': ['contact details', 'e-mail'],
                            'demographic': ['gender', 'age', 'birth'],
                            'location': ['location', 'geo']}
    # simple tokenization. it can be improved using some text pre-processing lib
    tokenized_text = one_line_text.split(" ")
    output = dict()
    for feature,tokens in custom_features_dict.items():
        output[feature] = False
        for word in tokenized_text:
            if word in tokens:
                output[feature] = True
    return output

def transform(text_list: list) -> list:
    output = list()
    for one_line_text in text_list:
        output.append(transform_str(one_line_text))
    return output

In this case, you don't need the fit method, because the fit was done manually.

Upvotes: 0

Related Questions