Reputation: 249
I'm doing binary classification over some documents whose features are already extracted and given in a text file. My problem is that there are textual features and numerical features like years and some other. One sample is given in this format:
label |title text |otherText text |numFeature1 number |numFeature2 number
I'm following the documentation about feature unions but their use case is a bit different. I do not extract the the features from another feature because these numerical features are already given.
Currently I'm using the setup in the following way:
pipeline = Pipeline([
('features', Features()),
('union', FeatureUnion(
transformer_list=[
('title', Pipeline([
('selector', ItemSelector(key='title')),
('tfidf', TfidfVectorizer()),
])),
('otherText', Pipeline([
('selector', ItemSelector(key='otherText')),
('tfidf', TfidfVectorizer()),
])),
('numFeature1', Pipeline([
('selector', ItemSelector(key='numFeature1')),
])),
('numFeature2', Pipeline([
('selector', ItemSelector(key='numFeature2')),
])),
],
)),
('classifier', MultinomialNB()),
])
The Feature class is also adopted from the documentation:
class Features(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, posts):
features = np.recarray(shape=(len(posts),),
dtype=[('title', object),('otherText', object),
('numFeature1', object),('numFeature2', object)])
for i, text in enumerate(posts):
l = re.split("\|\w+", text)
features['title'][i] = l[1]
features['otherText'][i] = l[2]
features['numFeature1'][i] = l[3]
features['numFeature2'][i] = l[4]
return features
My Problem is now: How do I add the numerical features into the FeatureUnion? When using a CountVectorizer i get "ValueError: empty vocabulary; perhaps the documents only contain stop words" and using a DictVectorizer with only one entry doesn't strike me as the way to go.
Upvotes: 2
Views: 1877
Reputation: 36609
What ItemSelector() does is, picks data from the given dict
(X) according to the key
supplied in the constructor and returns a one dimensional [n,]
array.
This type of [n,]
array is not handled correctly by the FeatureUnion
. FeatureUnion
requires an array of 2 dimensions from each of its internal transformers
in which the 1st dimension (number of samples) should be consistent, which can be finally stacked horizontally to combine the features.
The second operation in your first two transformers (TfidfVectorizer()
) takes this [n,]
array from ItemSelector and outputs a valid [n,m]
type of array where m = number of features extracted from raw text
.
But your 3rd and 4th transformers contain only ItemSelector()
, so it outputs [n,]
array. That is the cause of errors.
To correct this you should reshape the output of ItemSelector
to [n,1]
.
Change the following code in ItemSelector.transform()
(I'm assuming you are using ItemSelector code from the link you specified):
Original
data_dict[self.key]
New
data_dict[self.key].reshape((-1,1))
The reshape()
will format your [n,]
to [n,1]
which can then be used by the FeatureUnion to append the data correctly.
Upvotes: 0
Reputation: 9081
the TfidfVectorizer()
object has not been fitted with data yet.
Before constructing the pipeline, do this -
vec = TfidfVectorizer()
vec.fit(data['free text column'])
pipeline = Pipeline([
('features', Features()),
('union', FeatureUnion(
transformer_list=[
('title', Pipeline([
('selector', ItemSelector(key='title')),
('tfidf', vec),
])),
... other features
This helps if you want to fit your data again for test purposes... because for test data the pipeline would automatically use transform() function for the TfidfVectorizer
instead of fit() function which you have to explicitly do before constructing the pipeline
Upvotes: 0