Reputation: 113
I am trying to scrape the main table with tag :
<table _ngcontent-jna-c4="" class="rayanDynamicStatement">
from following website using 'BeautifulSoup' library, but the code returns empty [] while printing soup returns html string and request status is 200. I found out that when i use browser 'inspect element' tool i can see the table tag but in "view page source" the table tag which is part of "app-root" tag is not shown. (you see <app-root></app-root>
which is empty). Besides there is no "json" file in the webpage's components to extract data from it. Please help me how can I scrape the table data.
import urllib.request
import pandas as pd
from urllib.parse import unquote
from bs4 import BeautifulSoup
yurl='https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
req=urllib.request.urlopen(yurl)
print(req.status)
#get response
response = req.read()
html = response.decode("utf-8")
#make html readable
soup = BeautifulSoup(html, features="html")
table_body=soup.find_all("table")
print(table_body)
Upvotes: 2
Views: 1235
Reputation: 20042
The table is in the source HTML
but kinda hidden and then rendered by JavaScript
. It's in one of the <script>
tags. This can be located with bs4
and then parsed with regex
. Finally, the table data can be dumped to json.loads
then to a pandas
and to a .csv
file, but since I don't know any Persian, you'd have to see if it's of any use.
Just by looking at some values, I think it is.
Oh, and this can be done without selenium
.
Here's how:
import pandas as pd
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0"
scripts = BeautifulSoup(
requests.get(url, verify=False).content,
"lxml",
).find_all("script", {"type": "text/javascript"})
table_data = json.loads(
re.search(r"var datasource = ({.*})", scripts[-5].string).group(1),
)
pd.DataFrame(
table_data["sheets"][0]["tables"][0]["cells"],
).to_csv("huge_table.csv", index=False)
This outputs a huge file that looks like this:
Upvotes: 3
Reputation: 3541
Might not the best solution, but with webdriver in headless mode you can get all what you want:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
option = Options()
option.add_argument('--headless')
url = 'https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
print(bs.find('table'))
driver.quit()
Upvotes: 0
Reputation: 649
It looks like the elements your're trying to get are rendered by some JavaScript code. You will need to use something like Selenium instead in order to get the fully rendered HTML.
Upvotes: -1