Goutam
Goutam

Reputation: 423

How to scrape address (comma separated text) using Beautifulsoup in python

I am trying to scrape address from the below link:

https://www.yelp.com/biz/rollin-phatties-houston

But I am getting only the first value of the address (i.e.: 1731 Westheimer Rd) out of complete address which is separated by a comma:

1731 Westheimer Rd, Houston, TX 77098

Can anyone help me out in this, please find my code below:

import bs4 as bs
import urllib.request as url

source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')

mains = soup.find_all("div", {"class": "secondaryAttributes__09f24__3db5x arrange-unit__09f24__1gZC1 border-color--default__09f24__R1nRO"})
main = mains[0] #First item of mains

address = []
for main in mains:
    try:       
        address.append(main.address.find("p").text)
    except:
        address.append("")

print(address)
# 1731 Westheimer Rd

Upvotes: 0

Views: 580

Answers (3)

Mohit Natani
Mohit Natani

Reputation: 52

The business address that is shown on the webpage is generated dynamically. If you view Page Source of the URL, you will find that the address of the restaurant is stored in a script element. So you need to extract the address from it.

from bs4 import BeautifulSoup
import requests
import json
page = requests.get('https://www.yelp.com/biz/rollin-phatties-houston')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script', attrs={'type':'application/json'})
scriptcontent = scriptelements[2].text
scriptcontent = scriptcontent.replace('<!--', '')
scriptcontent = scriptcontent.replace('-->', '')
jsondata = json.loads(scriptcontent)
print(jsondata['bizDetailsPageProps']['bizContactInfoProps']['businessAddress'])

Using the above code, you will be able to extract the address of any business.

Upvotes: 1

import requests
import re
from ast import literal_eval


def main(url):
    r = requests.get(url)
    match = literal_eval(
        re.search(r'addressLines.+?(\[.+?])', r.text).group(1))
    print(*match)


main('https://www.yelp.com/biz/rollin-phatties-houston')

Output:

1731 Westheimer Rd Houston, TX 77098

Upvotes: 2

Jerry An
Jerry An

Reputation: 1432

There is no need to find the address information by inspecting the element, actually, the data inside a javascript tag element is passed onto the page already. You can get it by the following code

import chompjs
import bs4 as bs
import urllib.request as url

source = url.urlopen('https://www.yelp.com/biz/rollin-phatties-houston')
soup = bs.BeautifulSoup(source, 'html.parser')

javascript = soup.select("script")[16].string
data = chompjs.parse_js_object(javascript)
data['bizDetailsPageProps']['bizContactInfoProps']['businessAddress']

Upvotes: 1

Related Questions