Reputation: 127
I'm new to scraping, and am learning to use BeautifulSoup but I'm having trouble scraping a table. For the HTML I'm trying to parse:
<table id="ctl00_mainContent_DataList1" cellspacing="0" > style="width:80%;border-collapse:collapse;"> == $0
<tbody>
<tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
<tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
<tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
<tr><td><table width="90%" cellpadding="5" cellspacing="0">...</table></td></tr>
...
My code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
quote_page = 'https://www.bcdental.org/yourdentalhealth/findadentist.aspx'
page = urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
table = soup.find('table', id="ctl00_mainContent_DataList1")
rows = table.findAll('tr')
I get AttributeError: 'NoneType' object has no attribute 'findAll'
. I'm using python 3.6 and jupyter notebook for this in case that matters.
EDIT:
The table data that I'm trying to parse only shows up on the page after requesting a search (In the city
field, select Burnaby
, and hit search). The table ctl00_mainContent_DataList1
is the list of dentists that shows up after the search is submitted.
Upvotes: 1
Views: 6114
Reputation: 143216
First: I use requests
because it is easier to work with cookies, headers, etc.
Page is generated by ASP.net
and it sends values __VIEWSTATE
, __VIEWSTATEGENERATOR
, __EVENTVALIDATION
which you have to send in POST
request too.
You have to load page using GET
and then you can get those values.
You can also use request.Session()
to get cookies which can be needed too.
Next you have to copy values and add parameters from form and send it using POST
.
In code I put only parameters which are always send.
'526'
is code for Vancouver
. Other codes you can find in <select>
tag.
If you want other options then you may have to add other parameters.
ie. ctl00$mainContent$chkUndr4Ref: on
is for Children: 3 & Under - Diagnose & Refer
EDIT: because inside <tr>
is <table>
so find_all('tr')
returns too many elements (external tr
and internal tr
) and and later
find_all('td')give the same
tdmany times. I changed
find_all('tr')into
find_all('table')` and it should stop duplicate data.
import requests
from bs4 import BeautifulSoup
url = 'https://www.bcdental.org/yourdentalhealth/findadentist.aspx'
# --- session ---
s = requests.Session() # to automatically copy cookies
#s.headers.update({'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0'})
# --- GET request ---
# get page to get cookies and params
response = s.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# --- set params ---
params = {
# session - copy from GET request
#'EktronClientManager': '',
#'__VIEWSTATE': '',
#'__VIEWSTATEGENERATOR': '',
#'__EVENTVALIDATION': '',
# main options
'ctl00$terms': '',
'ctl00$mainContent$drpCity': '526',
'ctl00$mainContent$txtPostalCode': '',
'ctl00$mainContent$drpSpecialty': 'GP',
'ctl00$mainContent$drpLanguage': '0',
'ctl00$mainContent$drpSedation': '0',
'ctl00$mainContent$btnSearch': '+Search+',
# other options
#'ctl00$mainContent$chkUndr4Ref': 'on',
}
# copy from GET request
for key in ['EktronClientManager', '__VIEWSTATE', '__VIEWSTATEGENERATOR', '__EVENTVALIDATION']:
value = soup.find('input', id=key)['value']
params[key] = value
#print(key, ':', value)
# --- POST request ---
# get page with table - using params
response = s.post(url, data=params)#, headers={'Referer': url})
soup = BeautifulSoup(response.text, 'html.parser')
# --- data ---
table = soup.find('table', id='ctl00_mainContent_DataList1')
if not table:
print('no table')
#table = soup.find_all('table')
#print('count:', len(table))
#print(response.text)
else:
for row in table.find_all('table'):
for column in row.find_all('td'):
text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
print(text)
print('-----')
Part of result:
Map
Dr. Kashyap Vora, 6145 Fraser Street, Vancouver V5W 2Z9
604 321 1869, www.voradental.ca
-----
Map
Dr. Niloufar Shirzad, Harbour Centre DentalL19 - 555 Hastings Street West, Vancouver V6B 4N6
604 669 1195, www.harbourcentredental.com
-----
Map
Dr. Janice Brennan, 902 - 805 Broadway West, Vancouver V5Z 1K1
604 872 2525
-----
Map
Dr. Rosemary Chang, 1240 Kingsway, Vancouver V5V 3E1
604 873 1211
-----
Map
Dr. Mersedeh Shahabaldine, 3641 Broadway West, Vancouver V6R 2B8
604 734 2114, www.westkitsdental.com
-----
Upvotes: 2