Ruud
Ruud

Reputation: 41

Python beautifulsoup scrape site

I am trying to learn Python to scrape a websites lunch menu using beautifulsoup. I have made the request

r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

And the response looks like this:

<div class="lunchRow">
    <div class="lunchRowDay"><h3>Monday</h3></div>
    <div class="lunchRowItem"><div class="lunchRowItemActual">Meatballs</div>
    <div class="lunchRowItemActual">Soup</div>
    </div>
    </div>
    <div class="lunchRow">
    <div class="lunchRowDay"><h3>Tuesday</h3></div>
    <div class="lunchRowItem"><div class="lunchRowItemActual">Chicken</div>
    <div class="lunchRowItemActual">Pork</div>
    <div class="lunchRowItemActual">Fish</div>
    </div>
</div>

What is the easiest way to get the lunchRowItemActual for each day? I started by searching for the day and get the next div but after that I am lost and I assume this is not the way to solve it.

soup = soup.find(string="Monday").find_next('div').contents[0].text

Upvotes: 3

Views: 129

Answers (3)

Nalan PandiKumar
Nalan PandiKumar

Reputation: 358

First find all elements with lunchRow class Iterate through them to get each lunchRow and in that row find the lunchRowDay for the day.

Then findall the lunchRowItemActual within that lunchRow


# Find all the lunchRow divs
lunch_rows = soup.find_all('div', class_='lunchRow')

# Iterate through each lunchRow div
for row in lunch_rows:
    day = row.find('div', class_='lunchRowDay').text.strip()
    items = [item.text.strip() for item in row.find_all('div', class_='lunchRowItemActual')]
    print(f"{day}: {', '.join(items)}")


Monday: Meatballs, Soup
Tuesday: Chicken, Pork, Fish

Upvotes: 0

Alistair
Alistair

Reputation: 599

soup.select is a great way to do things like this.

Then use get_text to... get the text.

And some list comprehension will apply get_text to the whole list

days = soup.select("div.lunchRowDay")
for day in days:
    print(day.get_text())
    items = [item.get_text() for item in day.select("div.lunchRowItemActual")]
    print(items)

Upvotes: 1

Max Kaha
Max Kaha

Reputation: 922

First off you should try to get all lunchRow divs by their classname and save them to a variable like so:

rows = soup.findAll('div', attrs={'class': 'lunchRow'})

Then we can loop over them and get the individual days and items as follows. Here we get the first/only lunchRowDay item and then look for all lunchRowItemActual elements inside our current row:

for row in rows:
  print(row.find('div', attrs={'class': 'lunchRowDay'}).text)
  actuals = row.findAll('div', attrs={'class': 'lunchRowItemActual'})
  for actual in actuals:
    print(actual.text)

Output of this is:

Monday
Meatballs
Soup
Tuesday
Chicken
Pork
Fish

Instead of printing them out you most likely want to put them in a dict using the lunchRowDay as the key and then putting the lunchRowItemActual values into a list but that is up to you.

Upvotes: 3

Related Questions