Reputation: 1530
I have a pandas df with the column url
. The data looks like this:
row url
1 'https://www.delish.com/cooking/recipe-ideas/recipes/four-cheese'
2 'https://www.delish.com/holiday-recipes/thanksgiving/thanksgiving-cabbage/
3 'https://www.delish.com/kitchen-tools/cookware-reviews/advice/kitchen-tools-gadgets/'
I only need to grab the values of 2nd index, which is cooking or holiday-recipes, etc.
Desired output:
row url
1 cooking
2 holiday-recipes
3 kitchen-tools
I wanted to parse urls into different columns and then drop the columns that I don't need. Here is the code:
df['protocol'],df['domain'],df['path']=zip(*df['url'].map(urlparse(df['url']).urlsplit))
The error message is: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is there a better way to solve the issue? How can I grab the specific index?
Upvotes: 0
Views: 371
Reputation: 26676
Another way is to match the the alphas
with character -
immediately after com
df['url']=df['url'].str.extract('((?<=com\/)[a-z-]+)')
url
0 cooking
1 holiday-recipes
2 kitchen-tools
Upvotes: 1
Reputation: 7594
Is this what you're looking for?
df['url'] = df['url'].str.split('/').str[3]
print(df)
row url
0 1 cooking
1 2 holiday-recipes
2 3 kitchen-tools
Upvotes: 1