Reputation: 15
I have a column in a dataframe with column 'url_product' that contains a list of dictionaries as below (showing first 4 rows as an example). Each dictionary contains url and product associated with that url.
df.url_product[0]
[{'url': 'https://www.abcstore.com/product/11-abc-gift-card/',
'product': 'giftcard, abcstore'},
{'url': 'https://www.abcstore.com/product/10-skin-lotion/',
'product': 'lotion'},
{'url': 'https://www.abcstore.com/product/10414-moisturising-cream',
'product': 'cream'},
{'url': 'https://www.abcstore.com/blog/best-skincare-lotions/',
'product': 'lotion'},
{'url': 'https://www.abcstore.com/article/140-best-anti-aging-serum',
'product': 'serum'}]
df.url_product[1]
[{'url': 'https://www.abcstore.com/product/7-night-cream',
'product': 'nightcream'},
{'url': 'http://www.abcstore.com/product/149-smoothing-serum/',
'product': 'serum'},
{'url': 'https://www.abcstore.com/blog/rapid-reveal-face-peel',
'product': 'facepeel'}]
df.url_product[2]
[{'url': 'https://www.abcstore.com/product/25-night-infusion-cream',
'product': 'infusioncream'},
{'url': 'https://www.abcstore.com/product/144-bio-cellulose-mask',
'product': 'cellulosemask, mask'},
{'url': 'https://www.abcstore.com/',
'product': 'bestseller, homepage'},
{'url': 'https://www.abcstore.com/blog/essential-skincare-products/',
'product': 'essential, blog'}]
df.url_product[3]
[{'url': 'https://www.abcstore.com/blog/top-skincare-products-2020',
'product': 'skincare, 2020'},
{'url': 'http://www.abcstore.com/article/smoothing-serum/',
'product': 'serum'}]
For each of these rows, I am looking to do the following
{'url': 'https://www.abcstore.com/product/11-abc-gift-card/',
'product': 'giftcard, abcstore'}
{'url': 'https://www.abcstore.com/product/11-abc-gift-card/',
'product': 'giftcard, abcstore'}
Expected outcome for the first 3 three rows after the steps above
df.top_product_id
[0] '11'
[1] (blank)
[2] '114'
[3] (blank)
a few points to explain the expected outcome
Row[0]- expect 11 as product_id 11 has a count of 2 (giftcard, abcstore) while product_id 10 and 10414 only has 1 each. both the blog and article urls will be skipped as they do not contain '/product/' in the url
Row[1]- expect the outcome to be blank as the two product URLs are attached to 1 product each and since there is no single url with the highest count, row would be blank
Row[2]- expect 114 as product_id 114 has highest count of 2 (cellulosemask, mask)
Row[3]- expect the outcome to be blank as there are no product URLs
How would I create the new column ('top_product_id') in the dataframe with the expected outcome?
Upvotes: 1
Views: 47
Reputation: 71687
Here is one possible approach:
def findID(data):
df1 = pd.DataFrame(data)
df1 = df1.assign(
count=df1['product'].str.split(', ').str.len(),
product_id=df1['url'].str.extract(r'.*/product/(\d+)', expand=False)
).dropna().drop_duplicates(subset=['count'], keep=False)
if df1.empty:
return '(Blank)'
return df1.loc[df1['count'].idxmax(), 'product_id']
df['top_product_id'] = df['url_product'].apply(findID)
# print(df)
url_product top_product_id
0 [{'url': 'https://www.abcstore.com/product/11-... 11
1 [{'url': 'https://www.abcstore.com/product/7-n... (Blank)
2 [{'url': 'https://www.abcstore.com/product/25-... 144
3 [{'url': 'https://www.abcstore.com/blog/top-sk... (Blank)
Upvotes: 1