Ade Guntoro
Ade Guntoro

Reputation: 99

Can't scrape website using Beautiful Soup

i follow tutorial from here about scraping website with Python and BeautifulSoup. I try to scraping website from my goverment (for research purpose) but it give me error like this : Traceback (most recent call last):

File "C:/Python27/scrap web.py", line 8, in <module>
    name = name_box.text.strip()
AttributeError: 'NoneType' object has no attribute 'text'

i try another website like this and it work. And when i look into my goverment website and i use "View Page Source", i dont see code like <table id="tableLeftBottom">. So, how i can scrap data from this website ?

import urllib2
from bs4 import BeautifulSoup
quote_page = "https://bps.go.id/linkTableDinamis/view/id/1116"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")

name_box = soup.find("table", attrs={"id": "tableRightBottom"})
name = name_box.text.strip()
print name

Upvotes: 0

Views: 3169

Answers (3)

thouger
thouger

Reputation: 435

First you want to find the node which is name "th" and id is "th2b".But these content is create by javascript.When you open the site,you will see that loading.So you should use "headless browser".

chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gup")
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument("--hide-scrollbars")
chrome_options.add_argument('--dns-prefetch-disable')
chrome_options.add_argument("--disable-extensions")
chrome_options.binary_location = "you chrome path"
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.maximize_window()
response = driver.get(quote_page)
time.sleep(10)
page = response.get_body()
soup = BeautifulSoup(page, "html.parser")

name_box = soup.find("th", attrs={"id": "th2b"})
name = name_box.text.strip()
print(name)

you will get text.

Upvotes: 0

SIM
SIM

Reputation: 22440

To get the data from that page you need to make a post request to this url along with necessary parameters or you can try using any browser simulator. However, the first option is easy to go with. Here is how you can fetch the data using post request:

import requests
from bs4 import BeautifulSoup

URL = "https://bps.go.id/mod/Layout/variabelView.php"
payload = "valueDataSelect=**98+--+189+--+102+--+63+--+9818910263+--+1**98+--+190+--+102+--+63+--+9819010263+--+1**98+--+191+--+102+--+63+--+9819110263+--+1**98+--+189+--+105+--+63+--+9818910563+--+1**98+--+190+--+105+--+63+--+9819010563+--+1**98+--+191+--+105+--+63+--+9819110563+--+1**98+--+189+--+107+--+63+--+9818910763+--+1**98+--+190+--+107+--+63+--+9819010763+--+1**98+--+191+--+107+--+63+--+9819110763+--+1**98+--+189+--+108+--+63+--+9818910863+--+1**98+--+190+--+108+--+63+--+9819010863+--+1**98+--+191+--+108+--+63+--+9819110863+--+1**98+--+189+--+109+--+61+--+9818910961+--+1**98+--+190+--+109+--+61+--+9819010961+--+1**98+--+191+--+109+--+61+--+9819110961+--+1**98+--+189+--+110+--+61+--+9818911061+--+1**98+--+190+--+110+--+61+--+9819011061+--+1**98+--+191+--+110+--+61+--+9819111061+--+1**98+--+189+--+111+--+61+--+9818911161+--+1**98+--+189+--+111+--+62+--+9818911162+--+1**98+--+190+--+111+--+61+--+9819011161+--+1**98+--+190+--+111+--+62+--+9819011162+--+1**98+--+191+--+111+--+61+--+9819111161+--+1**98+--+191+--+111+--+62+--+9819111162+--+1**98+--+189+--+112+--+61+--+9818911261+--+1**98+--+189+--+112+--+62+--+9818911262+--+1**98+--+190+--+112+--+61+--+9819011261+--+1**98+--+190+--+112+--+62+--+9819011262+--+1**98+--+191+--+112+--+61+--+9819111261+--+1**98+--+191+--+112+--+62+--+9819111262+--+1**98+--+189+--+113+--+61+--+9818911361+--+1**98+--+189+--+113+--+62+--+9818911362+--+1**98+--+190+--+113+--+61+--+9819011361+--+1**98+--+190+--+113+--+62+--+9819011362+--+1**98+--+191+--+113+--+61+--+9819111361+--+1**98+--+191+--+113+--+62+--+9819111362+--+1**98+--+189+--+114+--+61+--+9818911461+--+1**98+--+189+--+114+--+62+--+9818911462+--+1**98+--+190+--+114+--+61+--+9819011461+--+1**98+--+190+--+114+--+62+--+9819011462+--+1**98+--+191+--+114+--+61+--+9819111461+--+1**98+--+191+--+114+--+62+--+9819111462+--+1**98+--+189+--+115+--+61+--+9818911561+--+1**98+--+189+--+115+--+62+--+9818911562+--+1**98+--+190+--+115+--+61+--+9819011561+--+1**98+--+190+--+115+--+62+--+9819011562+--+1**98+--+191+--+115+--+61+--+9819111561+--+1**98+--+191+--+115+--+62+--+9819111562+--+1**98+--+189+--+116+--+61+--+9818911661+--+1**98+--+189+--+116+--+62+--+9818911662+--+1**98+--+190+--+116+--+61+--+9819011661+--+1**98+--+190+--+116+--+62+--+9819011662+--+1**98+--+191+--+116+--+61+--+9819111661+--+1**98+--+191+--+116+--+62+--+9819111662+--+1**98+--+189+--+117+--+61+--+9818911761+--+1**98+--+189+--+117+--+62+--+9818911762+--+1**98+--+190+--+117+--+61+--+9819011761+--+1**98+--+190+--+117+--+62+--+9819011762+--+1**98+--+191+--+117+--+61+--+9819111761+--+1**98+--+191+--+117+--+62+--+9819111762+--+1&wilayahDataSelect=1%23%23~2~3~4~5~6~7~8~9~10~11~12~13~14~15~16~17~18~19~20~21~22~23~24~25~26~27~28~29~30~31~32~33~34~35~1%40%40%24%24%24%40%40&keteranganDataSelect=**Gini+Rasio++--+Perkotaan+--+2002+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2002+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2002+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2005+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2005+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2005+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2007+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2007+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2007+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2008+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2008+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2008+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2009+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2009+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2009+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2010+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2010+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2010+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2011+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2011+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2011+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2011+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2011+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2011+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2012+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2012+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2012+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2012+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2012+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2012+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2013+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2013+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2013+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2013+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2013+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2013+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2014+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2014+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2014+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2014+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2014+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2014+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2015+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2015+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2015+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2015+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2015+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2015+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2016+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2016+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2016+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2016+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2016+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2016+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2017+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2017+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2017+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2017+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2017+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2017+--+Semester+2+(September)&kirim=3&layout=Var"

with requests.Session() as s:
    s.headers={"User-Agent":"Mozilla/5.0"}
    s.headers.update({'Content-Type': 'application/x-www-form-urlencoded'})
    html = s.post(URL, data = payload).text
    soup = BeautifulSoup(html, "lxml")
    for items in soup.find(id="tableRightBottom").find_all("tr"):
        data = [item.text for item in items.find_all("td")]
        print(data)

Output:

[' - ', '0.332', '0.289', '0.301', '0.291', '0.312', '0.353', '0.370', '0.337', '0.407', '0.404', '0.382', '0.358', '0.380', '0.367', '0.368', '0.343', '0.362', '0.347', '0.334', ' - ', '0.239', '0.257', '0.253', '0.250', '0.261', '0.280', '0.269', '0.271', '0.260', '0.256', '0.254', '0.259', '0.277', '0.292', '0.293', '0.288', '0.296', '0.293', '0.299', ' - ', '0.288', '0.285', '0.290', '0.288', '0.301', '0.326', '0.326', '0.320', '0.341', '0.341', '0.331', '0.325', '0.337', '0.334', '0.339', '0.333', '0.341', '0.329', '0.329']

and so on ----

Upvotes: 1

R.yan
R.yan

Reputation: 2382

It is because the website html did not contain the data. The data is rendered by JavaScript inside the div with id dataDynamic. The data is from endpoint https://bps.go.id/mod/Layout/variabelView.php.

If you want to get the data, you can either use selenium or requests_html.

Upvotes: 0

Related Questions