How remove specific unigram from the text corpus but still maintaining the Bi-grams of that word?

Question

I have situation where I have to remove a specific words unigram from text corpus while maintaining bi-grams of that word along with unigrams of that word.

I am trying to pass a text address data ( column in a excel) along with some other numerical features to a classification algorithm. I need to countvectorize the text data and filter out specific uni-grams and attach them back to the dataframe so that classifier algorithm can understand it.

** sample data in Text Column**

TAJ MAHAL
TAJ MALABAR KOCHI
TAJ MALABAR KOCHI
TAJ  RESIDENCY  TVM
LEELA PALACE  
PALACE  ROAD
HILL VIEW ROAD
HILL  AVENUE
HILL STATION

For Taj and Hill ,I want only Bigrams and trigrams ,rest all words i want unigram,bigrams and trigrams.

**OUTPUT BIGRAM and UNIGRAM **

TAJ MAHAL
TAJ MALABAR 
MALABAR KOCHI
TAJ  RESIDENCY 
KOCHI
LEELA 
PALACE  
LEELA PALACE  
PALACE  ROAD
HILL VIEW
HILL  AVENUE
HILL STATION

When I try use stopwords as Taj and Hill , the bigrams and trigrams are also not generated

  cv = CountVectorizer( max_features = 200,analyzer='word',ngram_range=(1, 3))
    cv_txt = cv.fit_transform(data.pop('Txt'))
   for i, col in enumerate(cv.get_feature_names()):
    data[col] = pd.SparseSeries(cv_txt[:, i].toarray().ravel(), fill_value=0)

After filtering out the specific unigrams , i want attach them back to the dataframe so that I can run a classification algorithm. Final output is sparse matrix of countvectorized text data

How remove specific unigram from the text corpus but still maintaining the Bi-grams of that word?

Answers (1)

Related Questions