ℕʘʘḆḽḘ
ℕʘʘḆḽḘ

Reputation: 19375

how to merge two countvectorizers when there are duplicates?

Consider this simple example

data = pd.DataFrame({'text1' : ['hello world', 'hello universe'],
                     'text2': ['good morning', 'hello two three']})
    
data
Out[489]: 
            text1            text2
0     hello world     good morning
1  hello universe  hello two three

As you can see, text1 and text2 share one exact word in common: hello. I am trying to create ngrams separately for text1 and text2 and I want to concatenate the results together into a countvectorizer object.

The idea is that I want to create ngrams separately for the two variables and used them as features in a ML algo. However, I do want the extra ngrams that would be created by concatenating the string together, like world good in hello world good morning. This is why I keep the ngram creation separated.

The issue is that by doing so, the resulting (sparse) vector will contain a duplicated hello column.

See here:

vector = CountVectorizer(ngram_range=(1, 2))

v1 = vector.fit_transform(data.text1.values) 
print(vector.get_feature_names())

['hello', 'hello universe', 'hello world', 'universe', 'world']

v2 = vector.fit_transform(data.text2.values)
print(vector.get_feature_names())

['good', 'good morning', 'hello', 'hello two', 'morning', 'three', 'two', 'two three']

And now concatenating v1 and v2 gives 13 columns

from scipy.sparse import hstack
print(hstack((v1, v2)).toarray())

[[1 0 1 0 1 1 1 0 0 1 0 0 0]
 [1 1 0 1 0 0 0 1 1 0 1 1 1]]

The proper text-features should be 12:

hello, word, hello word, good, morning, good morning,hello universe,universe, two, three, hello two, two three

What can I do here to have the proper unique words as features? Thanks!

Upvotes: 2

Views: 404

Answers (2)

Antoine Dubuis
Antoine Dubuis

Reputation: 5304

I think that the best way to tackle this problematic would be to create a custom Transformer that use a CountVectorizer.

I would do as follow:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

class MultiRowsCountVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.verctorizer = CountVectorizer(ngram_range=(1, 2))
    
    def fit(self, X, y = None):
        #concatenate all textual columns into one column
        X_ = np.reshape(X.values, (-1,))
        self.verctorizer.fit(X_)
        return self
    

    def transform(self, X, y = None):
        #join all the textual columns into one column
        X_ = X.apply(' '.join, axis=1)
        return self.verctorizer.transform(X_)
    
    def get_feature_names(self):
        return self.verctorizer.get_feature_names()
    
    
transformer = MultiRowsCountVectorizer()
X_ = transformer.fit_transform(data)
transformer.get_feature_names()

The fit() method is fitting the CountVectorizer by treating the columns independently while transform() is treating the columns as the same line of text.

np.reshape(X.values, (-1,)) is transforming a matrix of shape (N, n_columns) into one dimensional array of size (N*n_columns,). This ensure that each text field is treated independently during the fit(). After that the transformation is applied on all the text feature of a sample by joining them together.

This custom Transformer is returning the desired 12 features:

['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'world']

and returning the following features:

[[1 1 1 0 0 1 1 0 0 0 0 1]
 [0 0 2 1 1 0 0 1 1 1 1 0]]

NOTES: this custom transformer assume that X is a pd.DataFrame with n textual columns.

EDIT: The textuals fields need to be joined with a space during the transform().

Upvotes: 2

Ric S
Ric S

Reputation: 9247

Disclaimer: this answer might not be very sophisticated, but if I understood correctly your problem it should do its job.

# create an additional column by chaining the two text columns with a fake word
data['text3'] = data['text1'] + ' xxxxxxxxxx ' + data['text2']
print(data)
#             text1            text2                                      text3
# 0     hello world     good morning        hello world xxxxxxxxxx good morning
# 1  hello universe  hello two three  hello universe xxxxxxxxxx hello two three

# instantiate CountVectorizer and fit it
vector = CountVectorizer(ngram_range=(1, 2))
v3 = vector.fit_transform(data.text3.values)

# have a look at the resulting column names
all_colnames = vector.get_feature_names()
print(all_colnames)
# ['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'universe xxxxxxxxxx', 'world', 'world xxxxxxxxxx', 'xxxxxxxxxx', 'xxxxxxxxxx good', 'xxxxxxxxxx hello']

# select only column names of interest
correct_colnames = [e for e in vector.get_feature_names() if 'xxxxxxxxxx' not in e]
print(correct_colnames)
# ['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'world']

print(len(all_colnames))
# 17
print(len(correct_colnames))
# 12   # the desired length

# select only the array columns where the fake word is absent
arr = v3.toarray()[:, ['xxxxxxxxxx' not in e for e in colnames]]
print(arr.shape)
print(arr)
# (2, 12)
# [[1 1 1 0 0 1 1 0 0 0 0 1]
#  [0 0 2 1 1 0 0 1 1 1 1 0]]

# if you need a pandas.DataFrame as result
new_df = pd.DataFrame(arr, columns=correct_colnames)
print(new_df)
#    good  good morning  hello  hello two  hello universe  hello world  morning  three  two  two three  universe  world
# 0     1             1      1          0               0            1        1      0    0          0         0      1
# 1     0             0      2          1               1            0        0      1    1          1         1      0

The rationale behind it is: we insert a fake word, like 'xxxxxxxxxx', which is close-to-impossible to encounter in a text string. The algorithm will consider it as a real word and therefore will create 1-grams and 2-grams with it.

However, we can eliminate those n-grams afterwards, and all the equal words (like 'hello' in this case) will not be counted separately for the two text columns - in fact, you can see that in the resulting dataframe, the word 'hello' appears two times in the second row, and it's not duplicated.

Upvotes: 2

Related Questions