Reputation: 7245
I would like to get information of architects from this website
https://www.sia.ch/en/membership/member-directory/m/207778/
In particular, I would like to extract information about name, address, telephone number, and email.
This is what I am trying to do but I am not able to extract such information.
I would like to have an output like the following:
person = ['Pierluigi A Marca', 'Sihlquai 244', '8005 Zürich', '+41 442734340', '[email protected]']
import pandas as pd
from urllib import *
from bs4 import BeautifulSoup
from lxml import html
import requests
URL = 'https://www.sia.ch/en/membership/member-directory/m/207778/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='content')
print(results.prettify())
<div class="pagewidth clearfix" id="content">
<div class="textheader">
</div>
<ul class="headlineicon clearfix">
<li class="print">
<a href="javascript:print();">
</a>
</li>
<li class="bookmark">
<a class="addthis_button_favorites" href="javascript:;">
<span>
</span>
</a>
</li>
<li class="share">
<li class="mail_widget">
<a class="addthis_button_email">
<img alt="" src="/fileadmin/templates/img/transp.gif"/>
</a>
</li>
<li class="googleplus">
<a class="addthis_button_google_plusone_share">
<img alt="" src="/fileadmin/templates/img/transp.gif"/>
</a>
</li>
<li class="twitter">
<a class="addthis_button_twitter">
<img alt="" src="/fileadmin/templates/img/transp.gif"/>
</a>
</li>
<li class="facebook">
<a class="addthis_button_facebook">
<img alt="" src="/fileadmin/templates/img/transp.gif"/>
</a>
</li>
<script type="text/javascript">
var addthis_config = { data_track_clickback: false }
</script>
</li>
</ul>
<div class="clearfix spec-height-theme">
<div class="narrowcolumnLeft">
<ul class="clearfix" id="subNavigation">
<li>
<a href="/en/membership/membership/" onfocus="blurLink(this);">
membership
</a>
<span>
</span>
</li>
<li class="active">
<a href="/en/membership/member-directory/" onfocus="blurLink(this);">
member directory
</a>
<span>
</span>
<ul>
<li>
<a href="/en/membership/member-directory/honorary-members/" onfocus="blurLink(this);">
honorary members
</a>
</li>
<li>
<a href="/en/membership/member-directory/individual-members/" onfocus="blurLink(this);">
individual members
</a>
</li>
<li>
<a href="/en/membership/member-directory/corporate-members/" onfocus="blurLink(this);">
corporate members
</a>
</li>
<li>
<a href="/en/membership/member-directory/student-members/" onfocus="blurLink(this);">
student members
</a>
</li>
<li>
<a href="/en/membership/member-directory/partner/" onfocus="blurLink(this);">
partner
</a>
</li>
</ul>
</li>
</ul>
</div>
<div class="widecolumn">
<!--TYPO3SEARCH_begin-->
<div class="csc-default" id="c303">
<div class="tx-updsiafeuseradmin-pi1">
<div class="tx-updsiafeuseradmin-pi1-singleView">
<div class="secr" data-secr="09d93fcfd5cf0f0b68e11bba96f6312c4023c72d">
</div>
<h1 class="mitgliederprofil">
Individual Member
</h1>
<table>
<tr>
<th colspan="2" valign="top">
Address
</th>
</tr>
<tr>
<td colspan="2" valign="top">
<!-- -->
<!--Dipl. Arch. ETH/SIA<br />-->
Mr
<br/>
Pierluigi A Marca
<br/>
Dipl. Arch. ETH/SIA
<br/>
Sihlquai 244
<br/>
8005 Zürich
<br/>
</td>
</tr>
<tr>
<th colspan="2" valign="top">
Contact
</th>
</tr>
<tr>
<td class="col1" valign="top">
Telephone number
<br/>
E-mail
<br/>
</td>
<td valign="top">
<div class="contact-data" data-contact="ggFeglggKF42DCpZz2iOI3EgcsZxN14vIYlhSGFLtORrpHZtgSiJ8tWDNuNxus03JD60nZu+g1FVPIdMiCp/bZMsSL45/+3xu9MMEZLnhH/Y67evbMdMICVsZaULHgIpA+S50ZdTg3glRtCa9CTX/zfXOfgyDaarW44HMYeW6pTMqImejlSubQXjCiPKzS0jgiZHBGspcnBZW/99X0ORYNaEUvOkjJDmozv9yld9A1x4jdyXAqHoDMMx0IICMsJiWcKADTFWKfI0OHHORhv7kvVW3KtbnX5PJjyilH0=">
needs javascript
</div>
</td>
</tr>
<tr>
<th colspan="2" valign="top">
Details
</th>
</tr>
<tr>
<td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
Profession
</td>
<td valign="top">
Diploma in Architecture
<br/>
</td>
</tr>
<tr>
<td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
Area of activity
<br/>
</td>
<td valign="top">
Architecture
<br/>
</td>
</tr>
<tr>
<td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
Professional group
</td>
<td valign="top">
Architecture
</td>
</tr>
<tr>
<td class="tx-updsiafeuseradmin-pi1-singleView-2cols" valign="top">
Section
</td>
<td valign="top">
Zurich
<br/>
</td>
</tr>
<tr>
<td colspan="2" valign="top">
</td>
</tr>
</table>
<!--<div class="tx-updsiafeuseradmin-pi1-singleView-footer lightbox-close-link"><a href="javascript:;">Close</a></div>-->
<div class="tx-updsiafeuseradmin-pi1-singleView-footer" style="display:none;">
<span>
</span>
<a href="javascript:history.back()">
back to results list
</a>
</div>
<script type="text/javascript">
jQuery(document).ready(function() {
if (document.referrer.split( "/" )[2] == "www.sia.ch") {
jQuery(".tx-updsiafeuseradmin-pi1-singleView-footer").show();
}
});
</script>
</div>
</div>
</div>
<!--TYPO3SEARCH_end-->
</div>
</div>
</div>
Upvotes: 0
Views: 83
Reputation: 441
You can do it without selenium. I won't provide code how to decode (due to legal reasons), but here some notices how you can do this:
// init hide contact
jQuery(".contact-data").html(Aes.Ctr.decrypt(
jQuery(".contact-data").data("contact"),
jQuery(".secr").data("secr"), 256));
});
//div[@class='contact-data']/@data-contact
and aes-key is here: //div[@class='secr']/@data-secr
Key is generated on each request.
Good luck!
Upvotes: 0
Reputation: 28565
You'll have to use Selenium to allow the javascript to render some of the details. Then you need to do a little manipulation. Thisget's you there and it includes the individual's title ('Mr.'
)
import pandas as pd
from selenium import webdriver
url = 'https://www.sia.ch/en/membership/member-directory/m/207778/'
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get(url)
html = driver.page_source
html = str(html).replace('<br />', '::')
df = pd.read_html(html)[0].iloc[[0,2],1]
contact = []
for x in df.tolist():
#x = df.tolist()[0]
alpha = x.split('::')
alpha = [ a.strip() for a in alpha if a != '' ]
contact.append(alpha)
contact = contact[0] + contact[1]
driver.close()
Output:
print (contact)
['Mr', 'Pierluigi A Marca', 'Dipl. Arch. ETH/SIA', 'Sihlquai 244', '8005 Zürich', '+41 442734340', '[email protected]']
Upvotes: 1