Rashmi Singh
Rashmi Singh

Reputation: 539

Scikit Learn - Extract word tokens from a string delimiter using CountVectorizer

I have list of strings. If any string contains the '#' character then I want to extract the first part of the string and get the frequency count of word tokens from this part of string only. i.e if the string is "first question # on stackoverflow" expected tokens are "first","question"

If the string does not contain '#' then return tokens of the whole string.

To compute the term document matrix I am using CountVectorizer from scikit.

Find below my code:

class MyTokenizer(object):
    def __call__(self,s):
        if(s.find('#')==-1):
            return s
        else:
            return s.split('#')[0]
    def FindKmeans():
        text = ["first ques # on stackoverflow", "please help"]
        vec = CountVectorizer(tokenizer=MyTokenizer(), analyzer = 'word')
        pos_vector = vec.fit_transform(text).toarray()
        print(vec.get_feature_names())`

output : [u' ', u'a', u'e', u'f', u'h', u'i', u'l', u'p', u'q', u'r', u's', u't', u'u']

Expected Output : [u'first', u'ques', u'please', u'help']

Upvotes: 5

Views: 5838

Answers (3)

redratear
redratear

Reputation: 249

 s.split('#',1)[0] 

# is your result. you dont need check that "#" is exist or not.

Upvotes: 0

Nickil Maveli
Nickil Maveli

Reputation: 29711

You could split on your separator(#) at most once and take the first part of the split.

from sklearn.feature_extraction.text import CountVectorizer

def tokenize(text):
    return([text.split('#', 1)[0].strip()])

text = ["first ques # on stackoverflow", "please help"]

vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(text).toarray()
vocab = vec.get_feature_names()

required_list = []
for word in vocab:
    required_list.extend(word.split())
print(required_list)

#['first', 'ques', 'please', 'help']

Upvotes: 6

piman314
piman314

Reputation: 5355

The problem lays with your tokenizer, you've split the string into the bits you want to keep and the bits you don't want to keep, but you've not split the string into words. Try using the tokenizer below

class MyTokenizer(object):
    def __call__(self,s):
        if(s.find('#')==-1):
            return s.split(' ')
        else:
            return s.split('#')[0].split(' ')

Upvotes: 2

Related Questions