Reputation: 157
I am writing custom transformers in scikit-learn in order to do specific operations on the array. For that I use inheritance of class TransformerMixin. It works fine when I deal only with one transformer. However when I try to chain them using FeatureUnion (or make_union), the array is replicated n-times. What could I do to avoid that? Am I using scikit-learn as it is supposed to be?
import numpy as np
from sklearn.base import TransformerMixin
from sklearn.pipeline import FeatureUnion
# creation of array
s1 = np.array(['foo', 'bar', 'baz'])
s2 = np.array(['a', 'b', 'c'])
X = np.column_stack([s1, s2])
print('base array: \n', X, '\n')
# A fake example that appends a column (Could be a score, ...) calculated on specific columns from X
class DummyTransformer(TransformerMixin):
def __init__(self, value=None):
TransformerMixin.__init__(self)
self.value = value
def fit(self, *_):
return self
def transform(self, X):
# appends a column (in this case, a constant) to X
s = np.full(X.shape[0], self.value)
X = np.column_stack([X, s])
return X
# as such, the transformer gives what I need first
transfo = DummyTransformer(value=1)
print('single transformer: \n', transfo.fit_transform(X), '\n')
# but when I try to chain them and create a pipeline I run into the replication of existing columns
stages = []
for i in range(2):
transfo = DummyTransformer(value=i+1)
stages.append(('step'+str(i+1),transfo))
pipeunion = FeatureUnion(stages)
print('Given result of the Feature union pipeline: \n', pipeunion.fit_transform(X), '\n')
# columns 1&2 from X are replicated
# I would expect:
expected = np.column_stack([X, np.full(X.shape[0], 1), np.full(X.shape[0], 2) ])
print('Expected result of the Feature Union pipeline: \n', expected, '\n')
Output:
base array:
[['foo' 'a']
['bar' 'b']
['baz' 'c']]
single transformer:
[['foo' 'a' '1']
['bar' 'b' '1']
['baz' 'c' '1']]
Given result of the Feature union pipeline:
[['foo' 'a' '1' 'foo' 'a' '2']
['bar' 'b' '1' 'bar' 'b' '2']
['baz' 'c' '1' 'baz' 'c' '2']]
Expected result of the Feature Union pipeline:
[['foo' 'a' '1' '2']
['bar' 'b' '1' '2']
['baz' 'c' '1' '2']]
Many thanks
Upvotes: 0
Views: 1143
Reputation: 36619
FeatureUnion
will just concatenate what its getting from internal transformers. Now in your internal transformers, you are sending same columns from each one. Its upon the transformers to correctly send the correct data forward.
I would advise you to just return the new data from the internal transformers, and then concatenate the remaining columns either from outside or inside the FeatureUnion
.
Look at this example if you havent already:
You can do this for example:
# This dont do anything, just pass the data as it is
class DataPasser(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return X
# Your transformer
class DummyTransformer(TransformerMixin):
def __init__(self, value=None):
TransformerMixin.__init__(self)
self.value = value
def fit(self, *_):
return self
# Changed this to only return new column after some operation on X
def transform(self, X):
s = np.full(X.shape[0], self.value)
return s.reshape(-1,1)
After that, further down in your code, change this:
stages = []
# Append our DataPasser here, so original data is at the beginning
stages.append(('no_change', DataPasser()))
for i in range(2):
transfo = DummyTransformer(value=i+1)
stages.append(('step'+str(i+1),transfo))
pipeunion = FeatureUnion(stages)
Running this new code has the result:
('Given result of the Feature union pipeline: \n',
array([['foo', 'a', '1', '2'],
['bar', 'b', '1', '2'],
['baz', 'c', '1', '2']], dtype='|S21'), '\n')
('Expected result of the Feature Union pipeline: \n',
array([['foo', 'a', '1', '2'],
['bar', 'b', '1', '2'],
['baz', 'c', '1', '2']], dtype='|S21'), '\n')
Upvotes: 2