Reputation: 51
I'm trying to scrape for the shop names and their following shop address at a Webpg that has this html structure:
<div class="post_content entry-content" itemprop="articleBody">
<p>...</p>
<p>...</p>
<h2>1. SHOP NAME</h2>
<p>...</p>
<p>...</p>
<p><strong>Address</strong>: Dhoby Ghaut 238889<br />
<strong>Prices: </strong>Starting from SGD3.50 <br />
<strong>Websites</strong>:<a href="https://..." target="_blank" rel="noopener"></a></p>
<h2>2. SHOP NAME</h2>
.
.
<h2>3. SHOP NAME</h2>
.
.
</div>
They do not use classes for each individual shop showcase. I have problems trying to get the address out, does anyone know how to?
This is my code for getting the Shop Name:
url= requests.get('https://avenueone.sg/recipes-food/bubble-tea-brands-singapore/').text
shop= []
address= []
soup = BeautifulSoup(url,'lxml')
for row in soup.find_all("h2"):
shop.append(row.text)
for line in row.find_all(string='Address'):
address.append(line.text)
import re
for i in soup.find('div', class_='post_content entry-content'):
for x in soup.find_all(re.compile("^Address")):
address.append(line.text)
I'm able to retreive the list of Shop names into a dataframe but not their corresponding Addresses. Can anyone help me out with this?
Upvotes: 1
Views: 157
Reputation: 84465
You can use the following css selectors and regex. The regex is just there to check if the address is actually pointing to a website url, and if so, to retrieve that url. Requires bs4 4.7.1 + as I use :contains
to target the Address strong tags/
from bs4 import BeautifulSoup as bs
import requests, re
r = requests.get('https://avenueone.sg/recipes-food/bubble-tea-brands-singapore/')
soup = bs(r.content, 'lxml')
names = [i.text.replace('\xa0',' ') for i in soup.select('.post_content p + h2')]
addresses = [i.next_sibling.replace('\xa0','').replace(':','').strip() if not re.search(r'See this|See their',i.next_sibling) else i.parent.a['href'] for i in soup.select('strong:contains("Address")') ]
results = dict(zip(names,addresses))
print(results)
Sample of results:
Upvotes: 1
Reputation: 6184
Since there are no classes, I would not use BeautifulSoup and fall back to regular expressions to find the addresses in the response. If the formatting is stable and if it is as you described in your question, we could use the following regex:
import re
address_pattern = "<strong>Address</strong>:(.+?)<br />"
addresses = re.findall(address_pattern, url)
We still need to relate the addresses to the shop names, but how that should be done depends on some assumptions you have not given. If there is exactly one address per shop, and the shops are stored in variable shops
, we can just zip(shops, addresses)
.
If we have to take into account missing or multiple addresses under some shop names, we can just split the response into chunks of shop entries and look for the address under each shop name separately:
addresses = [
re.findall(address_pattern, chunk)
for chunk in url.split("<h2>")[1:]
]
Now we have a list of lists (of potentially multiple or no addresses) which are between two "<h2>"
tags. Now zip(shops, addresses)
will give us an iterator of tuples, where first element is the shop name and second element a (potentially empty) list of addresses.
Upvotes: 1
Reputation: 5785
To get address, you can do similar to the given below logic,
>>> for row in soup.find_all('div', {'class':'post_content entry-content'}):
for el in row.find_all('p'):
if 'Address' in el.get_text():
print(el.get_text().split('\n')[0])
break # remove break in your actual code.
Address: Dhoby Ghaut MRT, 60 Orchard Road, #B2-06, Dhoby Ghaut 238889
Upvotes: 0