Reputation: 3744
I am trying to extract the tables generated by selecting "Branches", a city and a district from this site: https://www.acb.com.vn/wps/portal/en/atm
So far, I have been able to write the code to parse through each city and district:
from selenium.webdriver.support.ui import Select
from selenium.webdriver import Chrome
import pandas as pd
import time
webdriver = "chromedriver.exe"
driver = Chrome(webdriver)
driver.get('https://www.acb.com.vn/wps/portal/en/atm')
branch_selector = driver.find_element_by_xpath('//*[@id="branch"]')
branch_selector.click()
city = Select(driver.find_element_by_id('cityId'))
for i in range(len(city.options)):
city.select_by_index(i)
time.sleep(1)
district = Select(driver.find_element_by_id('districtId'))
for j in range(len(district.options)):
district.select_by_index(j)
time.sleep(1)
try:
find_btn = driver.find_element_by_xpath('//*[@id="frm-filter"]/div[3]/a[1]')
find_btn.click()
time.sleep(1)
except:
close_btn = driver.find_element_by_xpath('//*[@id="close-send-email"]/span[2]')
close_btn.click()
time.sleep(1)
Now, I want to extract the table that's displayed in each iteration of the 2 loops. However, if you look at the HTML for the table, it does not make use of the "table" tag:
So, how do I extract the table for each city-district pair?
I tried the following:
try:
click_btn = driver.find_element_by_xpath('//*[@id="frm-filter"]/div[3]/a[1]')
click_btn.click()
time.sleep(1)
table = driver.find_elements_by_class_name('tbody')
for table_row in table:
row = table_row.find_elements_by_class_name('row')
print ([r.text for r in row])
except:
close_btn = driver.find_element_by_xpath('//*[@id="close-send-email"]/span[2]')
close_btn.click()
time.sleep(1)
But it prints a list of blank elements for each city-district pair, the length of the list being as many addresses are present in the table for the corresponding city-district pair:
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '', '']
['', '', '', '']
['', '']
['', '']
I also tried to access each element in each row of the table individually:
try:
find_btn = driver.find_element_by_xpath('//*[@id="frm-filter"]/div[3]/a[1]')
find_btn.click()
time.sleep(1)
table = driver.find_elements_by_class_name('tbody')
for table_row in table:
row = table_row.find_elements_by_class_name('row')
for element in row:
time.sleep(1)
Type.append(element.find_element_by_class_name('col type'))
Address.append(element.find_element_by_class_name('col address'))
District.append(element.find_element_by_class_name('col district'))
Tel_Fax.append(element.find_element_by_class_name('col tel-fax'))
Hours.append(element.find_element_by_class_name('col hours'))
except:
close_btn = driver.find_element_by_xpath('//*[@id="close-send-email"]/span[2]')
close_btn.click()
time.sleep(1)
But this gives the following error:
---------------------------------------------------------------------------
NoSuchElementException Traceback (most recent call last)
<ipython-input-41-2d73f0dc931c> in <module>
39
---> 40 Type.append(element.find_element_by_class_name('col type'))
41 Address.append(element.find_element_by_class_name('col address'))
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"[id="col type"]"}
Since it says css selector
in the error, I tried the following:
element.find_element_by_css_selector('div.col.type').text
This outputs a blank string, ''
.
So, how do I do this?
EDIT: The HTML of the table, for one district-city selection, is:
<div class="tbody">
<div class="row" id="row1">
<div class="col stt">1</div>
<div class="col type">
PGD Hai Bà Trưng</div>
<div class="col address">56-58-60 Hai Bà Trưng, P. Bến Nghé, Quan 1, Ho Chi Minh</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 6291 3690<br>(028) 6291 3691</div>
<div class="col hours"> 07:00-16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('56-58-60 Hai Bà Trưng, P. Bến Nghé, Quan 1, Ho Chi Minh', '10.77714,106.704325', 1); return false;">Direction</a></div>
</div>
<div class="row" id="row2">
<div class="col stt">2</div>
<div class="col type">
PGD Đa Kao</div>
<div class="col address">45 Võ Thị Sáu, P. Đa Kao, Quan 1, Ho Chi Minh</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 6290 5980<br>(028) 6290 5981</div>
<div class="col hours"> 07:30 – 16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('45 Võ Thị Sáu, P. Đa Kao, Quan 1, Ho Chi Minh', '10.790715,106.69486', 2); return false;">Direction</a></div>
</div>
<div class="row" id="row3">
<div class="col stt">3</div>
<div class="col type">
PGD Nguyễn Công Trứ</div>
<div class="col address">74 - 76 Nguyễn Công Trứ, P. Nguyễn Thái Bình, Quan 1, Ho Chi Minh</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 3914 4470 <br>(028) 3914 4471</div>
<div class="col hours"> 07:30 – 16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('74 - 76 Nguyễn Công Trứ, P. Nguyễn Thái Bình, Quan 1, Ho Chi Minh', '10.76972,106.703142', 3); return false;">Direction</a></div>
</div>
<div class="row" id="row4">
<div class="col stt">4</div>
<div class="col type">
PGD Lê Lợi</div>
<div class="col address">72 Lê Lợi, P. Bến Thành, Quận 1, TP.Hồ Chí Minh</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 3821 4619<br>(028) 3821 4618</div>
<div class="col hours"> 07:00-16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('72 Lê Lợi, P. Bến Thành, Quận 1, TP.Hồ Chí Minh', '10.773541,106.699635', 4); return false;">Direction</a></div>
</div>
<div class="row" id="row5">
<div class="col stt">5</div>
<div class="col type">
CN Sài Gòn</div>
<div class="col address">41 Mạc Đỉnh Chi, P. Đakao, Quan 1, Ho Chi Minh</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 3824 3770<br>(028) 3824 3946</div>
<div class="col hours"> 07:30 – 16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('41 Mạc Đỉnh Chi, P. Đakao, Quan 1, Ho Chi Minh', '10.786191,106.697818', 5); return false;">Direction</a></div>
</div>
<div class="row" id="row6">
<div class="col stt">6</div>
<div class="col type">
PGD Nguyễn Thái Bình</div>
<div class="col address">176 – 178 Ký Con, P. Nguyễn Thái Bình, Quan 1, Ho Chi Minh</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 3915 1310<br>(028) 3915 1311</div>
<div class="col hours"> 07:30 – 16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('176 – 178 Ký Con, P. Nguyễn Thái Bình, Quan 1, Ho Chi Minh', '10.768917,106.696863', 6); return false;">Direction</a></div>
</div>
<div class="row" id="row7">
<div class="col stt">7</div>
<div class="col type">
PGD Bến Chương Dương</div>
<div class="col address">328 Võ Văn Kiệt, phường Cô Giang, Quận 1, Tp.HCM</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 3837 0586<br>(028) 3837 0584</div>
<div class="col hours"> 7h30-16h30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('328 Võ Văn Kiệt, phường Cô Giang, Quận 1, Tp.HCM', '10.76161,106.695998', 7); return false;">Direction</a></div>
</div>
<div class="row" id="row8">
<div class="col stt">8</div>
<div class="col type">
PGD Trần Khắc Chân</div>
<div class="col address">48-50 Nguyễn Hữu Cầu, P.Tân Định, Q.1, TP.HCM</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 3820 9990<br>(028) 3526 7738</div>
<div class="col hours"> 07:30 -16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('48-50 Nguyễn Hữu Cầu, P.Tân Định, Q.1, TP.HCM', '10.790724, 106.690976', 8); return false;">Direction</a></div>
</div>
<div class="row" id="row9">
<div class="col stt">9</div>
<div class="col type">
PGD Cống Quỳnh</div>
<div class="col address">106 108 Cống Quỳnh, P. Nguyễn Cư Trinh, Q.1</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 38385464<br>(028) 3925 6645</div>
<div class="col hours"> 07:30 -16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('106 108 Cống Quỳnh, P. Nguyễn Cư Trinh, Q.1', '10.764772,106.687505', 9); return false;">Direction</a></div>
</div>
<div class="row" id="row10">
<div class="col stt">10</div>
<div class="col type">
CN Bến Thành</div>
<div class="col address">96 Lý Tự Trọng, P. Bến Thành, Quan 1, Ho Chi Minh</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 3825 7949<br>(028) 3825 7950</div>
<div class="col hours"> 07:30-16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('96 Lý Tự Trọng, P. Bến Thành, Quan 1, Ho Chi Minh', '10.774379, 106.697395', 10); return false;">Direction</a></div>
</div>
<div class="row" id="row11">
<div class="col stt">11</div>
<div class="col type">
PGD Tân Định </div>
<div class="col address">261 Trần Quang Khải, Phường Tân Định, Quận 1, TP.HCM</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 3848 0520<br></div>
<div class="col hours"> 07:30 - 16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('261 Trần Quang Khải, Phường Tân Định, Quận 1, TP.HCM', '10.791284, 106.688080', 11); return false;">Direction</a></div>
</div>
<div class="row" id="row12">
<div class="col stt">12</div>
<div class="col type">
PGD Nguyễn Du</div>
<div class="col address">Tầng hầm 1, tầng trệt, tầng lửng và tầng 2 tòa nhà 480 đường Nguyễn Thị Minh Khai, Phường 2, Quận 3, TP.Hồ Chí Minh</div>
<div class="col district">1</div>
<div class="col tel-fax">(028) 35218626<br>(028) 35218627</div>
<div class="col hours"> 07:30 -16:30</div>
<div class="col control"><a href="#" title="Direction" class="btn-direction" onclick="showDialogDirection('Tầng hầm 1, tầng trệt, tầng lửng và tầng 2 tòa nhà 480 đường Nguyễn Thị Minh Khai, Phường 2, Quận 3, TP.Hồ Chí Minh', '10.777328,106.698459', 12); return false;">Direction</a></div>
</div>
</div>
Upvotes: 0
Views: 534
Reputation: 17368
On analysing the website, it makes a post request on submitting the form. The function in the website is as follows:
function findMap() {
var keyWord =document.getElementById("keyWord").value;
var cityId =document.getElementById("cityId").value;
var districtId =document.getElementById("districtId").value;
var isCheckBranch = document.getElementById("branch").checked;
var isCheckAtm = document.getElementById("atm").checked;
var isCheckWestern = document.getElementById("western").checked;
var isCheckCdm = document.getElementById("cdm").checked;
var branch="";
var atm="";
var western="";
var cdm="";
var input = document.getElementById ("keyWord");
var placeholder = input.placeholder;
if( keyWord == placeholder ){
keyWord = "";
}
if((!isCheckBranch) && (!isCheckAtm) && (!isCheckWestern) && (!isCheckCdm)){
showMessage('Please select Branch or ATM or Western Union or CDM.', 'branch');
return;
}
if((!districtId || 0 === districtId.length) && (!keyWord || 0 === keyWord)){
showMessage('Please select the province or enter the address.', 'keyWord');
return;
}
if(isCheckBranch){
branch = "branch";
}
if(isCheckAtm){
atm = "atm";
}
if(isCheckWestern){
western = "western";
}
if(isCheckCdm){
cdm = "cdm";
}
var url = '/ACBMapPortlet/en/Process.jsp';
var urlPattern = 'https://www.acb.com.vn:443/ACBMapPortlet/en/MapMobi.jsp';
$( "#resultSearch" ).load( url, { "params[]": [ "Search", branch, atm, western, cdm, districtId, keyWord, cityId, latlng, urlPattern]} );
}
So, now you can understand what happens when you click the submit button.
The website constructs the values as a form data. I'll explain one such request that contains as in the following screenshot
Scraping in python using the above information.
Here - cityId = 18, DistrictId (populated through ajax call) = 187
import requests
from bs4 import BeautifulSoup
import pandas as pd
res=requests.post("https://www.acb.com.vn/ACBMapPortlet/en/Process.jsp", data={"params[]": ["Search","branch","atm","western","cdm",187,"",18,0,"https://www.acb.com.vn:443/ACBMapPortlet/en/MapMobi.jsp"]})
result = res.text.replace("\n","").replace("\t","").replace("\r","")
soup = BeautifulSoup(result, "lxml")
headers = [i.text.strip() for i in soup.find("div",class_="thead").find_all("div",class_="col")[:-1]]
body = [[j.text.strip() for j in i.find_all("div",class_="col")[:-1]] for i in soup.find("div",class_="tbody").find_all("div",class_="row")]
df = pd.DataFrame(body, columns=headers)
print(df)
df.to_csv("data.csv", index=False)
Update 1:
In order to get the city Id - city id is hard coded in the website in the select
tag value
attribute.
In order to district Id: To get this, the website makes an ajax call.
function getDistrict(cityId) {
var url = '/ACBMapPortlet/en/DistrictSelectBox.jsp';
$.post( url, { cmd:'DISTRICT', cityId:cityId}, function(data) {
var content = $( data );
$("#divDistrict span").empty().append("District");
$("#iconselect").empty();
$("#districtId").empty().append(content);
});
}
How to get all districts given a cityId?
def districts_names(cityid):
data = {"cmd":"DISTRICT", "cityId": cityid}
res = res = requests.post("https://www.acb.com.vn/ACBMapPortlet/en/DistrictSelectBox.jsp", data=data)
soup = BeautifulSoup(res.text, "lxml")
return [(i["value"].strip(),i.text.strip()) for i in soup.find_all("option")]
Example:
districts_names(3)
will give the following
[('', 'District'),
('234', 'Ba Bể'),
('235', 'Bạch Thông'),
('236', 'Chợ Đồn'),
('237', 'Chợ Mới'),
('238', 'Na Rì'),
('239', 'Ngân Sơn'),
('240', 'Bắc Kạn')]
The output is of the format - (district_id, district_name)
Upvotes: 1