Reputation: 460
I have situation where I have to remove a specific words unigram from text corpus while maintaining bi-grams of that word along with unigrams of that word.
I am trying to pass a text address data ( column in a excel) along with some other numerical features to a classification algorithm. I need to countvectorize the text data and filter out specific uni-grams and attach them back to the dataframe so that classifier algorithm can understand it.
** sample data in Text Column**
TAJ MAHAL
TAJ MALABAR KOCHI
TAJ MALABAR KOCHI
TAJ RESIDENCY TVM
LEELA PALACE
PALACE ROAD
HILL VIEW ROAD
HILL AVENUE
HILL STATION
For Taj and Hill ,I want only Bigrams and trigrams ,rest all words i want unigram,bigrams and trigrams.
**OUTPUT BIGRAM and UNIGRAM **
TAJ MAHAL
TAJ MALABAR
MALABAR KOCHI
TAJ RESIDENCY
KOCHI
LEELA
PALACE
LEELA PALACE
PALACE ROAD
HILL VIEW
HILL AVENUE
HILL STATION
When I try use stopwords as Taj and Hill , the bigrams and trigrams are also not generated
cv = CountVectorizer( max_features = 200,analyzer='word',ngram_range=(1, 3))
cv_txt = cv.fit_transform(data.pop('Txt'))
for i, col in enumerate(cv.get_feature_names()):
data[col] = pd.SparseSeries(cv_txt[:, i].toarray().ravel(), fill_value=0)
After filtering out the specific unigrams , i want attach them back to the dataframe so that I can run a classification algorithm. Final output is sparse matrix of countvectorized text data
Upvotes: 0
Views: 965
Reputation: 5355
If you just want to remove the specific unigrams then you will have to remove them from the transformed data using a mask. If this is going to be used in anything more complicated than a one off analysis I would suggest writing a wrapper class to manage it otherwise it will become difficult to keep track.
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
X = """TAJ MAHAL
TAJ MALABAR KOCHI
TAJ MALABAR KOCHI
TAJ RESIDENCY TVM
LEELA PALACE
PALACE ROAD
HILL VIEW ROAD
HILL AVENUE
HILL STATION"""
X = X.split('\n')
df = pd.DataFrame(dict(txt=X))
cv = CountVectorizer(max_features = 200, analyzer='word', ngram_range=(1, 3))
cv.fit(df.txt)
feat_name = cv.get_feature_names()
#List of unigrams to remove (will work for ngrams too)
remove_list = ['taj', 'hill']
# This is the mask of features you want to keep
keep_mask = ~np.in1d(feat_name, remove_list)
# before the mask
X_transformed = cv.transform(df.txt)
print(X_transformed.shape)
# after the mask
X_transformed = X_transformed[:, keep_mask]
print(X_transformed.shape)
EDIT to updated question
# code to do the pandas merge
feat_name = np.array(feat_name)[keep_mask]
df_2 = pd.SparseDataFrame(data=X_transformed,
columns=feat_name,
default_fill_value=0)
df_merge = df.merge(df_2, left_index=True, right_index=True)
Output:
(9, 13)
(9, 11)
To get this in one neat dataframe, just a
Upvotes: 2