Reputation: 539
I have list of strings. If any string contains the '#' character then I want to extract the first part of the string and get the frequency count of word tokens from this part of string only. i.e if the string is "first question # on stackoverflow" expected tokens are "first","question"
If the string does not contain '#' then return tokens of the whole string.
To compute the term document matrix I am using CountVectorizer
from scikit.
Find below my code:
class MyTokenizer(object):
def __call__(self,s):
if(s.find('#')==-1):
return s
else:
return s.split('#')[0]
def FindKmeans():
text = ["first ques # on stackoverflow", "please help"]
vec = CountVectorizer(tokenizer=MyTokenizer(), analyzer = 'word')
pos_vector = vec.fit_transform(text).toarray()
print(vec.get_feature_names())`
output : [u' ', u'a', u'e', u'f', u'h', u'i', u'l', u'p', u'q', u'r', u's', u't', u'u']
Expected Output : [u'first', u'ques', u'please', u'help']
Upvotes: 5
Views: 5838
Reputation: 249
s.split('#',1)[0]
#
is your result. you dont need check that "#" is exist or not.
Upvotes: 0
Reputation: 29711
You could split on your separator(#
) at most once and take the first part of the split.
from sklearn.feature_extraction.text import CountVectorizer
def tokenize(text):
return([text.split('#', 1)[0].strip()])
text = ["first ques # on stackoverflow", "please help"]
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(text).toarray()
vocab = vec.get_feature_names()
required_list = []
for word in vocab:
required_list.extend(word.split())
print(required_list)
#['first', 'ques', 'please', 'help']
Upvotes: 6
Reputation: 5355
The problem lays with your tokenizer, you've split the string into the bits you want to keep and the bits you don't want to keep, but you've not split the string into words. Try using the tokenizer below
class MyTokenizer(object):
def __call__(self,s):
if(s.find('#')==-1):
return s.split(' ')
else:
return s.split('#')[0].split(' ')
Upvotes: 2