est
est

Reputation: 51

Beautiful Soup web scraping: How do i scrape this particular html structure

I'm trying to scrape for the shop names and their following shop address at a Webpg that has this html structure:

<div class="post_content entry-content" itemprop="articleBody">
<p>...</p>
<p>...</p>
<h2>1. SHOP NAME</h2>
<p>...</p>
<p>...</p>
<p><strong>Address</strong>: Dhoby Ghaut 238889<br />
<strong>Prices: </strong>Starting from SGD3.50 <br />
<strong>Websites</strong>:<a href="https://..." target="_blank" rel="noopener"></a></p>

<h2>2. SHOP NAME</h2>
.
.
<h2>3. SHOP NAME</h2>
.
.
</div>

They do not use classes for each individual shop showcase. I have problems trying to get the address out, does anyone know how to?

This is my code for getting the Shop Name:

url= requests.get('https://avenueone.sg/recipes-food/bubble-tea-brands-singapore/').text

shop= []
address= []

soup = BeautifulSoup(url,'lxml')

for row in soup.find_all("h2"): 
    shop.append(row.text)
    for line in row.find_all(string='Address'):
        address.append(line.text)
import re
for i in soup.find('div', class_='post_content entry-content'):
    for x in soup.find_all(re.compile("^Address")):
        address.append(line.text)

I'm able to retreive the list of Shop names into a dataframe but not their corresponding Addresses. Can anyone help me out with this?

Upvotes: 1

Views: 157

Answers (3)

QHarr
QHarr

Reputation: 84465

You can use the following css selectors and regex. The regex is just there to check if the address is actually pointing to a website url, and if so, to retrieve that url. Requires bs4 4.7.1 + as I use :contains to target the Address strong tags/

from bs4 import BeautifulSoup as bs
import requests, re

r = requests.get('https://avenueone.sg/recipes-food/bubble-tea-brands-singapore/')
soup = bs(r.content, 'lxml')
names = [i.text.replace('\xa0',' ') for i in soup.select('.post_content p + h2')]
addresses = [i.next_sibling.replace('\xa0','').replace(':','').strip() if not re.search(r'See this|See their',i.next_sibling) else i.parent.a['href'] for i in soup.select('strong:contains("Address")') ]
results = dict(zip(names,addresses))
print(results)

Sample of results:

enter image description here

Upvotes: 1

Jonathan Herrera
Jonathan Herrera

Reputation: 6184

Since there are no classes, I would not use BeautifulSoup and fall back to regular expressions to find the addresses in the response. If the formatting is stable and if it is as you described in your question, we could use the following regex:

import re


address_pattern = "<strong>Address</strong>:(.+?)<br />"
addresses = re.findall(address_pattern, url)

We still need to relate the addresses to the shop names, but how that should be done depends on some assumptions you have not given. If there is exactly one address per shop, and the shops are stored in variable shops, we can just zip(shops, addresses).

If we have to take into account missing or multiple addresses under some shop names, we can just split the response into chunks of shop entries and look for the address under each shop name separately:

addresses = [
    re.findall(address_pattern, chunk) 
    for chunk in url.split("<h2>")[1:]
]

Now we have a list of lists (of potentially multiple or no addresses) which are between two "<h2>" tags. Now zip(shops, addresses) will give us an iterator of tuples, where first element is the shop name and second element a (potentially empty) list of addresses.

Upvotes: 1

shaik moeed
shaik moeed

Reputation: 5785

To get address, you can do similar to the given below logic,

>>> for row in soup.find_all('div', {'class':'post_content entry-content'}):
    for el in row.find_all('p'):
        if 'Address' in el.get_text():
            print(el.get_text().split('\n')[0])
            break # remove break in your actual code.


Address: Dhoby Ghaut MRT, 60 Orchard Road, #B2-06, Dhoby Ghaut 238889

Upvotes: 0

Related Questions