pankaj
pankaj

Reputation: 460

How remove specific unigram from the text corpus but still maintaining the Bi-grams of that word?

I have situation where I have to remove a specific words unigram from text corpus while maintaining bi-grams of that word along with unigrams of that word.

I am trying to pass a text address data ( column in a excel) along with some other numerical features to a classification algorithm. I need to countvectorize the text data and filter out specific uni-grams and attach them back to the dataframe so that classifier algorithm can understand it.

** sample data in Text Column**

TAJ MAHAL
TAJ MALABAR KOCHI
TAJ MALABAR KOCHI
TAJ  RESIDENCY  TVM
LEELA PALACE  
PALACE  ROAD
HILL VIEW ROAD
HILL  AVENUE
HILL STATION

For Taj and Hill ,I want only Bigrams and trigrams ,rest all words i want unigram,bigrams and trigrams.

**OUTPUT BIGRAM and UNIGRAM **

TAJ MAHAL
TAJ MALABAR 
MALABAR KOCHI
TAJ  RESIDENCY 
KOCHI
LEELA 
PALACE  
LEELA PALACE  
PALACE  ROAD
HILL VIEW
HILL  AVENUE
HILL STATION

When I try use stopwords as Taj and Hill , the bigrams and trigrams are also not generated

  cv = CountVectorizer( max_features = 200,analyzer='word',ngram_range=(1, 3))
    cv_txt = cv.fit_transform(data.pop('Txt'))
   for i, col in enumerate(cv.get_feature_names()):
    data[col] = pd.SparseSeries(cv_txt[:, i].toarray().ravel(), fill_value=0)

After filtering out the specific unigrams , i want attach them back to the dataframe so that I can run a classification algorithm. Final output is sparse matrix of countvectorized text data

Upvotes: 0

Views: 965

Answers (1)

piman314
piman314

Reputation: 5355

If you just want to remove the specific unigrams then you will have to remove them from the transformed data using a mask. If this is going to be used in anything more complicated than a one off analysis I would suggest writing a wrapper class to manage it otherwise it will become difficult to keep track.

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

X = """TAJ MAHAL
TAJ MALABAR KOCHI
TAJ MALABAR KOCHI
TAJ  RESIDENCY  TVM
LEELA PALACE  
PALACE  ROAD
HILL VIEW ROAD
HILL  AVENUE
HILL STATION"""
X = X.split('\n')
df = pd.DataFrame(dict(txt=X))

cv = CountVectorizer(max_features = 200, analyzer='word', ngram_range=(1, 3))
cv.fit(df.txt)
feat_name = cv.get_feature_names()

#List of unigrams to remove (will work for ngrams too)
remove_list = ['taj', 'hill']

# This is the mask of features you want to keep
keep_mask = ~np.in1d(feat_name, remove_list)

# before the mask
X_transformed = cv.transform(df.txt)
print(X_transformed.shape)

# after the mask
X_transformed = X_transformed[:, keep_mask]
print(X_transformed.shape)

EDIT to updated question

# code to do the pandas merge
feat_name = np.array(feat_name)[keep_mask]
df_2 = pd.SparseDataFrame(data=X_transformed,
                          columns=feat_name,
                          default_fill_value=0)
df_merge = df.merge(df_2, left_index=True, right_index=True)

Output:

(9, 13)
(9, 11)

To get this in one neat dataframe, just a

Upvotes: 2

Related Questions