Reputation: 99
i follow tutorial from here about scraping website with Python and BeautifulSoup. I try to scraping website from my goverment (for research purpose) but it give me error like this : Traceback (most recent call last):
File "C:/Python27/scrap web.py", line 8, in <module>
name = name_box.text.strip()
AttributeError: 'NoneType' object has no attribute 'text'
i try another website like this and it work. And when i look into my goverment website and i use "View Page Source", i dont see code like <table id="tableLeftBottom">
. So, how i can scrap data from this website ?
import urllib2
from bs4 import BeautifulSoup
quote_page = "https://bps.go.id/linkTableDinamis/view/id/1116"
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
name_box = soup.find("table", attrs={"id": "tableRightBottom"})
name = name_box.text.strip()
print name
Upvotes: 0
Views: 3169
Reputation: 435
First you want to find the node which is name "th" and id is "th2b".But these content is create by javascript.When you open the site,you will see that loading.So you should use "headless browser".
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gup")
chrome_options.add_argument('--disable-infobars')
chrome_options.add_argument("--hide-scrollbars")
chrome_options.add_argument('--dns-prefetch-disable')
chrome_options.add_argument("--disable-extensions")
chrome_options.binary_location = "you chrome path"
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.maximize_window()
response = driver.get(quote_page)
time.sleep(10)
page = response.get_body()
soup = BeautifulSoup(page, "html.parser")
name_box = soup.find("th", attrs={"id": "th2b"})
name = name_box.text.strip()
print(name)
you will get text.
Upvotes: 0
Reputation: 22440
To get the data from that page you need to make a post
request to this url along with necessary parameters or you can try using any browser simulator. However, the first option is easy to go with. Here is how you can fetch the data using post
request:
import requests
from bs4 import BeautifulSoup
URL = "https://bps.go.id/mod/Layout/variabelView.php"
payload = "valueDataSelect=**98+--+189+--+102+--+63+--+9818910263+--+1**98+--+190+--+102+--+63+--+9819010263+--+1**98+--+191+--+102+--+63+--+9819110263+--+1**98+--+189+--+105+--+63+--+9818910563+--+1**98+--+190+--+105+--+63+--+9819010563+--+1**98+--+191+--+105+--+63+--+9819110563+--+1**98+--+189+--+107+--+63+--+9818910763+--+1**98+--+190+--+107+--+63+--+9819010763+--+1**98+--+191+--+107+--+63+--+9819110763+--+1**98+--+189+--+108+--+63+--+9818910863+--+1**98+--+190+--+108+--+63+--+9819010863+--+1**98+--+191+--+108+--+63+--+9819110863+--+1**98+--+189+--+109+--+61+--+9818910961+--+1**98+--+190+--+109+--+61+--+9819010961+--+1**98+--+191+--+109+--+61+--+9819110961+--+1**98+--+189+--+110+--+61+--+9818911061+--+1**98+--+190+--+110+--+61+--+9819011061+--+1**98+--+191+--+110+--+61+--+9819111061+--+1**98+--+189+--+111+--+61+--+9818911161+--+1**98+--+189+--+111+--+62+--+9818911162+--+1**98+--+190+--+111+--+61+--+9819011161+--+1**98+--+190+--+111+--+62+--+9819011162+--+1**98+--+191+--+111+--+61+--+9819111161+--+1**98+--+191+--+111+--+62+--+9819111162+--+1**98+--+189+--+112+--+61+--+9818911261+--+1**98+--+189+--+112+--+62+--+9818911262+--+1**98+--+190+--+112+--+61+--+9819011261+--+1**98+--+190+--+112+--+62+--+9819011262+--+1**98+--+191+--+112+--+61+--+9819111261+--+1**98+--+191+--+112+--+62+--+9819111262+--+1**98+--+189+--+113+--+61+--+9818911361+--+1**98+--+189+--+113+--+62+--+9818911362+--+1**98+--+190+--+113+--+61+--+9819011361+--+1**98+--+190+--+113+--+62+--+9819011362+--+1**98+--+191+--+113+--+61+--+9819111361+--+1**98+--+191+--+113+--+62+--+9819111362+--+1**98+--+189+--+114+--+61+--+9818911461+--+1**98+--+189+--+114+--+62+--+9818911462+--+1**98+--+190+--+114+--+61+--+9819011461+--+1**98+--+190+--+114+--+62+--+9819011462+--+1**98+--+191+--+114+--+61+--+9819111461+--+1**98+--+191+--+114+--+62+--+9819111462+--+1**98+--+189+--+115+--+61+--+9818911561+--+1**98+--+189+--+115+--+62+--+9818911562+--+1**98+--+190+--+115+--+61+--+9819011561+--+1**98+--+190+--+115+--+62+--+9819011562+--+1**98+--+191+--+115+--+61+--+9819111561+--+1**98+--+191+--+115+--+62+--+9819111562+--+1**98+--+189+--+116+--+61+--+9818911661+--+1**98+--+189+--+116+--+62+--+9818911662+--+1**98+--+190+--+116+--+61+--+9819011661+--+1**98+--+190+--+116+--+62+--+9819011662+--+1**98+--+191+--+116+--+61+--+9819111661+--+1**98+--+191+--+116+--+62+--+9819111662+--+1**98+--+189+--+117+--+61+--+9818911761+--+1**98+--+189+--+117+--+62+--+9818911762+--+1**98+--+190+--+117+--+61+--+9819011761+--+1**98+--+190+--+117+--+62+--+9819011762+--+1**98+--+191+--+117+--+61+--+9819111761+--+1**98+--+191+--+117+--+62+--+9819111762+--+1&wilayahDataSelect=1%23%23~2~3~4~5~6~7~8~9~10~11~12~13~14~15~16~17~18~19~20~21~22~23~24~25~26~27~28~29~30~31~32~33~34~35~1%40%40%24%24%24%40%40&keteranganDataSelect=**Gini+Rasio++--+Perkotaan+--+2002+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2002+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2002+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2005+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2005+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2005+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2007+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2007+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2007+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2008+--+Tahunan**Gini+Rasio++--+Perdesaan+--+2008+--+Tahunan**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2008+--+Tahunan**Gini+Rasio++--+Perkotaan+--+2009+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2009+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2009+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2010+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2010+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2010+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2011+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2011+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2011+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2011+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2011+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2011+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2012+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2012+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2012+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2012+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2012+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2012+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2013+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2013+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2013+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2013+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2013+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2013+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2014+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2014+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2014+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2014+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2014+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2014+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2015+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2015+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2015+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2015+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2015+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2015+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2016+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2016+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2016+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2016+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2016+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2016+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan+--+2017+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan+--+2017+--+Semester+2+(September)**Gini+Rasio++--+Perdesaan+--+2017+--+Semester+1+(Maret)**Gini+Rasio++--+Perdesaan+--+2017+--+Semester+2+(September)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2017+--+Semester+1+(Maret)**Gini+Rasio++--+Perkotaan%2BPerdesaan+--+2017+--+Semester+2+(September)&kirim=3&layout=Var"
with requests.Session() as s:
s.headers={"User-Agent":"Mozilla/5.0"}
s.headers.update({'Content-Type': 'application/x-www-form-urlencoded'})
html = s.post(URL, data = payload).text
soup = BeautifulSoup(html, "lxml")
for items in soup.find(id="tableRightBottom").find_all("tr"):
data = [item.text for item in items.find_all("td")]
print(data)
Output:
[' - ', '0.332', '0.289', '0.301', '0.291', '0.312', '0.353', '0.370', '0.337', '0.407', '0.404', '0.382', '0.358', '0.380', '0.367', '0.368', '0.343', '0.362', '0.347', '0.334', ' - ', '0.239', '0.257', '0.253', '0.250', '0.261', '0.280', '0.269', '0.271', '0.260', '0.256', '0.254', '0.259', '0.277', '0.292', '0.293', '0.288', '0.296', '0.293', '0.299', ' - ', '0.288', '0.285', '0.290', '0.288', '0.301', '0.326', '0.326', '0.320', '0.341', '0.341', '0.331', '0.325', '0.337', '0.334', '0.339', '0.333', '0.341', '0.329', '0.329']
and so on ----
Upvotes: 1
Reputation: 2382
It is because the website html did not contain the data. The data is rendered by JavaScript inside the div
with id dataDynamic
. The data is from endpoint https://bps.go.id/mod/Layout/variabelView.php.
If you want to get the data, you can either use selenium or requests_html.
Upvotes: 0