beubeu
beubeu

Reputation: 95

Pandas dataframe to doc2vec.LabeledSentence

I have this dataframe :

    order_id    product_id  user_id          
    2           33120       u202279  
    2           28985       u202279  
    2           9327        u202279  
    4           39758       u178520  
    4           21351       u178520  
    5           6348        u156122  
    5           40878       u156122  

Type user_id : String
Type product_id : Integer

I would like to use this dataframe to create a Doc2vec corpus. So, I need to use the LabeledSentence function to create a dict :
{tags : user_id, words: all product ids ordered by each user_id}

But the the dataframe shape is (32434489, 3), so I should avoid to use a loop to create my labeledSentence.

I try to run this function (below) with multiprocessing but is too long.

Have you any idea to transform my dataframe in the good format for a Doc2vec corpus where the tag is the user_id and the words is the list of products by user_id?

def append_to_sequences(i):
     user_id = liste_user_id.pop(0)
     liste_produit_userID = data.ix[data["user_id"]==user_id, "product_id"].astype(str).tolist()
     return doc2vec.LabeledSentence(words=prd_user_list, tags=user_id )

pool = multiprocessing.Pool(processes=3)
result = pool.map_async(append_to_sequences, np.arange(len_liste_unique_user))
pool.close()
pool.join()
sentences = result.get()

Upvotes: 1

Views: 2278

Answers (1)

gojomo
gojomo

Reputation: 54173

Using multiprocessing is likely overkill. The forking of processes can wind up duplicating all existing memory, and involve excess communication marshalling results back into the master process.

Using a loop should be OK. 34 million rows (and far fewer unique user_ids) isn't that much, depending on your RAM.

Note that in recent versions of gensim TaggedDocument is the preferred class for Doc2Vec examples.

If we were to assume you have a list of all unique user_ids in liste_user_id, and a (new, not shown) function that gets the list-of-words for a user_id called words_for_user(), creating the documents for Doc2Vec in memory could be as simple as:

documents = [TaggedDocument(words=words_for_user(uid), tags=[uid])
             for uid in liste_user_id]

Note that tags should be a list of tags, not a single tag – even though in many common cases each document only has a single tag. (If you provide a single string tag, it will see tags as a list-of-characters, which is not what you want.)

Upvotes: 2

Related Questions