yasharov
yasharov

Reputation: 113

scraping table from a website result as empty

I am trying to scrape the main table with tag :

<table _ngcontent-jna-c4="" class="rayanDynamicStatement">

from following website using 'BeautifulSoup' library, but the code returns empty [] while printing soup returns html string and request status is 200. I found out that when i use browser 'inspect element' tool i can see the table tag but in "view page source" the table tag which is part of "app-root" tag is not shown. (you see <app-root></app-root> which is empty). Besides there is no "json" file in the webpage's components to extract data from it. Please help me how can I scrape the table data.

import urllib.request
import pandas as pd
from urllib.parse import unquote
from bs4 import BeautifulSoup
yurl='https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
req=urllib.request.urlopen(yurl)
print(req.status)
#get response
response = req.read()
html = response.decode("utf-8")
#make html readable
soup = BeautifulSoup(html, features="html")
table_body=soup.find_all("table")
print(table_body)

Upvotes: 2

Views: 1235

Answers (3)

baduker
baduker

Reputation: 20042

The table is in the source HTML but kinda hidden and then rendered by JavaScript. It's in one of the <script> tags. This can be located with bs4 and then parsed with regex. Finally, the table data can be dumped to json.loads then to a pandas and to a .csv file, but since I don't know any Persian, you'd have to see if it's of any use.

Just by looking at some values, I think it is.

Oh, and this can be done without selenium.

Here's how:

import pandas as pd
import json
import re

import requests
from bs4 import BeautifulSoup

url = "https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0"
scripts = BeautifulSoup(
    requests.get(url, verify=False).content,
    "lxml",
).find_all("script", {"type": "text/javascript"})

table_data = json.loads(
    re.search(r"var datasource = ({.*})", scripts[-5].string).group(1),
)

pd.DataFrame(
    table_data["sheets"][0]["tables"][0]["cells"],
).to_csv("huge_table.csv", index=False)

This outputs a huge file that looks like this:

enter image description here

Upvotes: 3

Vova
Vova

Reputation: 3541

Might not the best solution, but with webdriver in headless mode you can get all what you want:

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

option = Options()
option.add_argument('--headless')
url = 'https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
print(bs.find('table'))
driver.quit()

Upvotes: 0

Samy
Samy

Reputation: 649

It looks like the elements your're trying to get are rendered by some JavaScript code. You will need to use something like Selenium instead in order to get the fully rendered HTML.

Upvotes: -1

Related Questions