Reputation: 541
It seems to be simple data manipulation operation. But I am stuck at this.
I have a recommendation dataset for a campaign.
Masteruserid content
1 100
1 101
1 102
2 100
2 101
2 110
Now for each user we want to recommend atleast 5 content. So for instance Masteruserid 1 has three recommendations, I want to pick remaining two randomly from globally viewed content, which is a separate dataset(list). Then I have to also check for duplicates in case if the randomly picked is already present in the raw dataset.
global_content
100
300
301
101
In actual I have around 4000+ Masteruserid's. Now I want assistance in just how to start approaching this.
Upvotes: 2
Views: 155
Reputation: 25649
Try this, using this as recs list,
df2['global_content']
0 100
1 300
2 301
3 101
4 400
5 500
6 401
7 501
recs = pd.DataFrame()
recs['content'] = df.groupby('Masteruserid')['content'].apply(lambda x: list(x) + np.random.choice(df2[~df2.isin(list(x))].dropna().values.flatten(), 2, replace=False).tolist())
recs
content
Masteruserid
1 [100, 101, 102, 300.0, 301.0]
2 [100, 101, 110, 501.0, 301.0]
Upvotes: 0
Reputation: 294358
def add_content(df, gc, k=5):
n = len(df)
gcs = set(gc.squeeze())
if n < k:
choices = list(gcs.difference(df.content))
mc = np.random.choice(choices, k - n, replace=False)
ids = np.repeat(df.Masteruserid.iloc[-1], k - n)
data = dict(Masteruserid=ids, content=mc)
return df.append(pd.DataFrame(data), ignore_index=True)
gb = df.groupby('Masteruserid', group_keys=False)
gb.apply(add_content, gc).reset_index(drop=True)
Upvotes: 1