How to add weight factor to CountVectorizer

Question

I am pretty new to data science. I'm trying to solve nlp clustering problem using LDA. I've encountered with problem using CountVectorizer from sklearn.

I've got a Data Frame:

df = pd.DataFrame({'id':[1,2,3],'word':[['one', 'two', 'four'],
                                    ['five', 'six', 'nine'], 
                                    ['eight', 'eleven', 'ten']]})
df2 =  df.copy().assign(word = df.word.map(lambda y: " ".join(y)))


    id  word
0   1   one two four
1   2   five six nine
2   3   eight eleven ten

And I've got a piece of code from web which works good for my problem:

cvectorizer = CountVectorizer(min_df=4, max_features=10000,ngram_range=(1,2))
cvz = cvectorizer.fit_transform(df2['word'])

All I want is to add some kind of weight factor to the values in word column. It should work like this: the first element of an array in word's column should have the weight of len(lengths of an array) and in descending order from the beginning to the end of an array.

For example: for row with id = 1 I want the following situation:

{one:3, two:2, four:1}

Where int values is my weight parameters.

And after this I want that weighted values to be pushed into CountVectorizer.

I've read documentation but I just can't get how to solve my problem.

charlesreid1 · Accepted Answer

The essential function here is the split() method - from it, you can both turn your list of words into a list of strings, and also get the integers you want to assign to each string.

The Final Answer: Here is a drop-in dictionary-making method and apply() calls to apply it:

def make_dict(list1,list2):
    d = {}
    for k,v in zip(list1,list2):
        d[k] = v
    return d

df2['word'].apply(lambda x : (x.split(" "), [i for i in reversed(range(1,len(x.split(" "))+1))])).apply(lambda y : make_dict(y[0],y[1]))

This will return a Series, with each element of the Series being the dictionary you requested for that particular row. An explanation of this expression follows.

Explanation: Start with a list comprehension that will create a tuple - the first item of the tuple is the split list of strings that will be your dictionary keys. The second item of the tuple is the split list of integers that will become the dictionary values (these are basically just a reversed list generated by a call to range(), whose arguments come from the string split method mentioned at the start of the answer)

In [1]: df2['word'].apply(lambda x : (x.split(" "), [i for i in reversed(range(1,len(x.split(" "))+1))]))
Out[1]:
0        ([one, two, four], [3, 2, 1])
1       ([five, six, nine], [3, 2, 1])
2    ([eight, eleven, ten], [3, 2, 1])

Next, define a function that takes two lists as arguments (we know, from the operation above, that these two lists must be lists of the same length, so we don't need to enforce a check that they are of the same length, unless we're paranoid) and stitches them together into a dictionary:

In [2]: def make_dict(list1,list2):
    ...:     d = {}
    ...:     for k,v in zip(list1,list2):
    ...:         d[k] = v
    ...:     return d

List1 turns into the set of keys, and list2 turns into the set of values. (Note this will overwrite old keys if keys are repeated, e.g., if one of your columns is "one one one").

Now all that remains is to combine the output of the first expression with the function defined above, which we can do with another apply():

In [3]: df2['word'].apply(lambda x : (x.split(" "), [i for i in reversed(range(1,len(x.split(" "))+1))])).apply(lambda y : make_dict(y[0],y[1]))
Out[3]:
0        {'one': 3, 'two': 2, 'four': 1}
1       {'five': 3, 'six': 2, 'nine': 1}
2    {'eight': 3, 'eleven': 2, 'ten': 1}
Name: word, dtype: object

How to add weight factor to CountVectorizer

Answers (1)

Related Questions