Reputation: 19375
Consider this simple example
data = pd.DataFrame({'text1' : ['hello world', 'hello universe'],
'text2': ['good morning', 'hello two three']})
data
Out[489]:
text1 text2
0 hello world good morning
1 hello universe hello two three
As you can see, text1
and text2
share one exact word in common: hello
. I am trying to create ngrams separately for text1
and text2
and I want to concatenate the results together into a countvectorizer object.
The idea is that I want to create ngrams separately for the two variables and used them as features in a ML algo. However, I do want the extra ngrams that would be created by concatenating the string together, like world good
in hello world good morning
. This is why I keep the ngram creation separated.
The issue is that by doing so, the resulting (sparse) vector will contain a duplicated hello
column.
See here:
vector = CountVectorizer(ngram_range=(1, 2))
v1 = vector.fit_transform(data.text1.values)
print(vector.get_feature_names())
['hello', 'hello universe', 'hello world', 'universe', 'world']
v2 = vector.fit_transform(data.text2.values)
print(vector.get_feature_names())
['good', 'good morning', 'hello', 'hello two', 'morning', 'three', 'two', 'two three']
And now concatenating v1
and v2
gives 13 columns
from scipy.sparse import hstack
print(hstack((v1, v2)).toarray())
[[1 0 1 0 1 1 1 0 0 1 0 0 0]
[1 1 0 1 0 0 0 1 1 0 1 1 1]]
The proper text-features should be 12:
hello
, word
, hello word
, good
, morning
, good morning
,hello universe
,universe
, two
, three
, hello two
, two three
What can I do here to have the proper unique words as features? Thanks!
Upvotes: 2
Views: 404
Reputation: 5304
I think that the best way to tackle this problematic would be to create a custom Transformer that use a CountVectorizer
.
I would do as follow:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
class MultiRowsCountVectorizer(BaseEstimator, TransformerMixin):
def __init__(self):
self.verctorizer = CountVectorizer(ngram_range=(1, 2))
def fit(self, X, y = None):
#concatenate all textual columns into one column
X_ = np.reshape(X.values, (-1,))
self.verctorizer.fit(X_)
return self
def transform(self, X, y = None):
#join all the textual columns into one column
X_ = X.apply(' '.join, axis=1)
return self.verctorizer.transform(X_)
def get_feature_names(self):
return self.verctorizer.get_feature_names()
transformer = MultiRowsCountVectorizer()
X_ = transformer.fit_transform(data)
transformer.get_feature_names()
The fit()
method is fitting the CountVectorizer
by treating the columns independently while transform()
is treating the columns as the same line of text.
np.reshape(X.values, (-1,))
is transforming a matrix of shape (N, n_columns)
into one dimensional array of size (N*n_columns,)
. This ensure that each text field is treated independently during the fit()
. After that the transformation is applied on all the text feature of a sample by joining them together.
This custom Transformer is returning the desired 12 features:
['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'world']
and returning the following features:
[[1 1 1 0 0 1 1 0 0 0 0 1]
[0 0 2 1 1 0 0 1 1 1 1 0]]
NOTES: this custom transformer assume that X
is a pd.DataFrame
with n
textual columns.
EDIT: The textuals fields need to be joined with a space during the transform()
.
Upvotes: 2
Reputation: 9247
Disclaimer: this answer might not be very sophisticated, but if I understood correctly your problem it should do its job.
# create an additional column by chaining the two text columns with a fake word
data['text3'] = data['text1'] + ' xxxxxxxxxx ' + data['text2']
print(data)
# text1 text2 text3
# 0 hello world good morning hello world xxxxxxxxxx good morning
# 1 hello universe hello two three hello universe xxxxxxxxxx hello two three
# instantiate CountVectorizer and fit it
vector = CountVectorizer(ngram_range=(1, 2))
v3 = vector.fit_transform(data.text3.values)
# have a look at the resulting column names
all_colnames = vector.get_feature_names()
print(all_colnames)
# ['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'universe xxxxxxxxxx', 'world', 'world xxxxxxxxxx', 'xxxxxxxxxx', 'xxxxxxxxxx good', 'xxxxxxxxxx hello']
# select only column names of interest
correct_colnames = [e for e in vector.get_feature_names() if 'xxxxxxxxxx' not in e]
print(correct_colnames)
# ['good', 'good morning', 'hello', 'hello two', 'hello universe', 'hello world', 'morning', 'three', 'two', 'two three', 'universe', 'world']
print(len(all_colnames))
# 17
print(len(correct_colnames))
# 12 # the desired length
# select only the array columns where the fake word is absent
arr = v3.toarray()[:, ['xxxxxxxxxx' not in e for e in colnames]]
print(arr.shape)
print(arr)
# (2, 12)
# [[1 1 1 0 0 1 1 0 0 0 0 1]
# [0 0 2 1 1 0 0 1 1 1 1 0]]
# if you need a pandas.DataFrame as result
new_df = pd.DataFrame(arr, columns=correct_colnames)
print(new_df)
# good good morning hello hello two hello universe hello world morning three two two three universe world
# 0 1 1 1 0 0 1 1 0 0 0 0 1
# 1 0 0 2 1 1 0 0 1 1 1 1 0
The rationale behind it is: we insert a fake word, like 'xxxxxxxxxx'
, which is close-to-impossible to encounter in a text string. The algorithm will consider it as a real word and therefore will create 1-grams and 2-grams with it.
However, we can eliminate those n-grams afterwards, and all the equal words (like 'hello'
in this case) will not be counted separately for the two text columns - in fact, you can see that in the resulting dataframe, the word 'hello'
appears two times in the second row, and it's not duplicated.
Upvotes: 2