how to merge two countvectorizers when there are duplicates?

Question

Consider this simple example

data = pd.DataFrame({'text1' : ['hello world', 'hello universe'],
                     'text2': ['good morning', 'hello two three']})
    
data
Out[489]: 
            text1            text2
0     hello world     good morning
1  hello universe  hello two three

As you can see, text1 and text2 share one exact word in common: hello. I am trying to create ngrams separately for text1 and text2 and I want to concatenate the results together into a countvectorizer object.

The idea is that I want to create ngrams separately for the two variables and used them as features in a ML algo. However, I do want the extra ngrams that would be created by concatenating the string together, like world good in hello world good morning. This is why I keep the ngram creation separated.

The issue is that by doing so, the resulting (sparse) vector will contain a duplicated hello column.

See here:

vector = CountVectorizer(ngram_range=(1, 2))

v1 = vector.fit_transform(data.text1.values) 
print(vector.get_feature_names())

['hello', 'hello universe', 'hello world', 'universe', 'world']

v2 = vector.fit_transform(data.text2.values)
print(vector.get_feature_names())

['good', 'good morning', 'hello', 'hello two', 'morning', 'three', 'two', 'two three']

And now concatenating v1 and v2 gives 13 columns

from scipy.sparse import hstack
print(hstack((v1, v2)).toarray())

[[1 0 1 0 1 1 1 0 0 1 0 0 0]
 [1 1 0 1 0 0 0 1 1 0 1 1 1]]

The proper text-features should be 12:

hello, word, hello word, good, morning, good morning,hello universe,universe, two, three, hello two, two three

What can I do here to have the proper unique words as features? Thanks!

Antoine Dubuis · Accepted Answer

I think that the best way to tackle this problematic would be to create a custom Transformer that use a CountVectorizer.

I would do as follow:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

class MultiRowsCountVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.verctorizer = CountVectorizer(ngram_range=(1, 2))
    
    def fit(self, X, y = None):
        #concatenate all textual columns into one column
        X_ = np.reshape(X.values, (-1,))
        self.verctorizer.fit(X_)
        return self
    

    def transform(self, X, y = None):
        #join all the textual columns into one column
        X_ = X.apply(' '.join, axis=1)
        return self.verctorizer.transform(X_)
    
    def get_feature_names(self):
        return self.verctorizer.get_feature_names()
    
    
transformer = MultiRowsCountVectorizer()
X_ = transformer.fit_transform(data)
transformer.get_feature_names()

The fit() method is fitting the CountVectorizer by treating the columns independently while transform() is treating the columns as the same line of text.

np.reshape(X.values, (-1,)) is transforming a matrix of shape (N, n_columns) into one dimensional array of size (N*n_columns,). This ensure that each text field is treated independently during the fit(). After that the transformation is applied on all the text feature of a sample by joining them together.

This custom Transformer is returning the desired 12 features:

['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'world']

and returning the following features:

[[1 1 1 0 0 1 1 0 0 0 0 1]
 [0 0 2 1 1 0 0 1 1 1 1 0]]

NOTES: this custom transformer assume that X is a pd.DataFrame with n textual columns.

EDIT: The textuals fields need to be joined with a space during the transform().

how to merge two countvectorizers when there are duplicates?

Answers (2)

Related Questions