How to extract the first instance of a nested class using BeautifulSoup

Question

There are multiple classes who all share the name "row", within each row class, there are multiple classes who all share the name "column".

I am trying to iterate through the row class, gathering only the first column of each row.

I am then printing out the link contents of that data

What is the correct way to do this? I have tried making a list, but after creating the list, I am no longer able to use the beautifulsoup functions on the object.

This is the link to the url :

https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils

rows = soup.find_all('div', attrs={'class': 'row'})

for row in rows:
    col = row.find('div', attrs={'class': 'column'})
    link = col.find('a')
    print link.contents

Dan-Dev · Accepted Answer

It looks like you need a cookie set before you can see content on the sub category page. So if I understand the question rght:

import requests
from bs4 import BeautifulSoup
# You need to store cookies so use a session.
s = requests.Session()
# Reques a page to get cookie.
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories")
# Make the real request.
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils")
soup = BeautifulSoup(page.content,'html.parser') 
# Get the div.
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'})
# Get the a element text.
for div in divs:
    print (div.find('a').text)

Outputs:

Balsam Fir 15 ml
Balsam Fir 30 ml
Balsam Fir 5 ml
Basil Essential Oil  15ml
Basil Essential Oil  30ml
Basil Essential Oil  3ml
Basil Essential Oil  5ml
Bergamot Essential Oil  15ml
...

If you just want unique names strip the size off with a regex and add to a set:

import requests
from bs4 import BeautifulSoup
import re
# You need to store cookies so use a session.
s = requests.Session()
# Reques a page to get cookie.
s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories")
# Make the real request.
page = s.get("https://www.theherbarium.com/products/?category=Essential%20Oils%20And%20Accessories&subcategory=Superior%20Quality%20Essential%20Oils")
soup = BeautifulSoup(page.content,'html.parser') 
# Get the div.
divs = soup.find_all('div', attrs={'class': 'col-sm-4 column-spacer'})
# Get the a element text.
a = set()
for div in divs:
    text = div.find('a').text
    a.add(re.sub('\s*\d+\s*ml$', '', text))
print (a)

Outputs:

    {'Lavender, Bulgarian Essential Oil', 'Thyme, White', 'Mandarin, Red Essential Oil', 'Pine Needle Essential Oil', 'Lemongrass Essential Oil', 'Fir Needle, Siberian', 'Spruce', 'Peppermint', 'Lime Essential Oil', 'Myrrh', 'Juniper Essential Oil', 'Petitgrain', 'Wintergreen', 'Lemon Essential Oil', 'Palmarosa', 'Balsam Fir', 'Chamomile, Roman', 'Cypress', 'Citronella', 'Rosemary', 'Lemon myrtle Essential Oil', 'Clary Sage', 'Cinnamon Bark', 'Frankincense', 'Tangerine', 'Cocoa, Absolute', 'Spearmint', 'Ravensara Essential Oil', 'Spike Lavender Essential Oil', 'Hyssop', 'Ylang Ylang', 'Basil Essential Oil', 'Bergamot Essential Oil', 'Fir Needle, Siberian1', 'Geranium Bourbon', 'Patchouli', 'Black Pepper Essential Oil', 'Fennel', 'Grapefruit Essential Oil', 'Eucalyptus', 'Carrot Seed Essential Oil', 'Chamomile, German', 'Vetiver', 'Tea Tree', 'Ginger', 'Marjoram, Sweet', 'Clove Bud'}

How to extract the first instance of a nested class using BeautifulSoup

Answers (1)

Related Questions