Newbier_RP
Newbier_RP

Reputation: 11

How to scrape the aspx.net

Python is new to me. I've been attempting to download the table from this website: https://tradereport.moc.go.th/Report/ReportEng.aspx?Report=HarmonizeCommodity&Lang=Eng&ImExType=1&Option=1 However, it appears complicated because the page source code does not have a button which needs to be clicked to show the report.

Thank you very much.

Firstly, before getting to that complicated table structure, I make it simple by testing only how to get the result table by clicking "ReviewReport" button. It shows table on the webpage but nothing I can scrape out of the page source. Please help me how to get the table data.

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import pandas as pd

Year = ('2021')
Month = ('11')
HScode = ('721990')

options = Options()
#options.headless = True
#driver = webdriver.Chrome('C:\\Python\Scripts\chromedriver.exe', options=options)
driver = webdriver.Firefox(executable_path=r'C:\Python\geckodriver.exe')
driver.get("https://tradereport.moc.go.th/Report/ReportEng.aspx?      Report=HarmonizeCommodity&Lang=Eng&ImExType=1&Option=1")
time.sleep(5)
driver.get("https://tradereport.moc.go.th/Report/ReportEng.aspx?    Report=HarmonizeCommodity&Lang=Eng&ImExType=1&Option=1") #reenter the url again to get the table page
time.sleep(5)  


driver.find_element_by_id("ddlYear").send_keys(Year)  #Working fine
driver.find_element_by_id("ddlMonth").send_keys(Month)  #Working fine
driver.find_element_by_id("txtHsCode").send_keys(HScode) #Working fine
submitbttn = driver.find_element_by_id("btnSubmit")       #Working fine
submitbttn.click()

time.sleep(5)

f=open("d:\\page.txt","w")
f.write(driver.page_source)
f.close()

print ("************* Scrapping data done**********************")
driver.quit`

Upvotes: 1

Views: 53

Answers (1)

sound wave
sound wave

Reputation: 3547

By inspecting the html I noticed that the table is contained into an iframe, so first thing to do is to switch to it so that selenium can find elements inside it

iframe_id = 'ASPxDocumentViewer1_Splitter_Viewer_ContentFrame'
driver.switch_to.frame(iframe_id)

Then we scrape the header of the table, which is composed by two lines (I called them header1 and header2). For the sake of simplicity we squeeze them into one line, called header in the code. This is what it looks like

['No',
 'Country',
 'Quanity (Dec.2022)',
 'Value (Dec.2022)',
 'Share (Dec.2022)',
 'Quanity (Jan. - Dec.  2022)',
 'Value (Jan. - Dec.  2022)',
 'Share (Jan. - Dec.  2022)']

Then we can start scraping the values of the table. You can do it in two ways: by rows or by columns. In our case (actually, almost in all cases) there are more rows than columns (28 vs 8), so it is faster to do it by columns. At the end of the loop the variable columns will be a list containing 8 lists, each one containing 28 elements. So by using the header as keys and the columns as values we can create a dictionary, which then we pass it to pd.DataFrame to create a table, which we then save to a csv named tradereport_data.csv.

header1 = [td.text for td in driver.find_elements(By.XPATH, "//div[@id='report_div']//tr[6]/td[@class]")]
header2 = [td.text for td in driver.find_elements(By.XPATH, "//div[@id='report_div']//tr[7]/td[@class]")]
header = header1[:2] + [f'{h} ({header1[2]})' for h in header2[2:5]] + [f'{h} ({header1[3]})' for h in header2[5:8]]

columns_number = 8
columns = []
for p in range(1,columns_number+1):
    columns.append( [x.text.replace('\n','').strip() for x in driver.find_elements(By.XPATH, f"//div[@id='report_div']//tr[(position()>7) and (position()<last()-1)]/td[@class][{p}]")] )

df = pd.DataFrame(dict(zip(header,columns)))
df.to_csv('tradereport_data.csv', index=False)

and this is what df looks like

enter image description here

As a final note, what the xpath tr[(position()>7) and (position()<last()-1)] does is to select all the tr elements excluding the first seven and the last two.

Upvotes: 0

Related Questions