Web scraping with BS4: unable to get table

Question

When you open the URL below in a browser,

http://www.kianfunds2.com/%D8%A7%D8%B1%D8%B2%D8%B4-%D8%AF%D8%A7%D8%B1%D8%A7%DB%8C%DB%8C-%D9%87%D8%A7-%D9%88-%D8%AA%D8%B9%D8%AF%D8%A7%D8%AF-%D9%88%D8%A7%D8%AD%D8%AF-%D9%87%D8%A7

you see a purple icon by the name of "copy". When you select this icon("copy"), you will achieve a complete table that you can paste into Excel. How can I get this table as an input in Python?

My code is below, and it shows nothing:

import requests
from bs4 import BeautifulSoup
url = "http://www.kianfunds2.com/" + "ارزش-دارایی-ها-و-تعداد-واحد-ها"
result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")
table = soup.find("a", class_="dt-button buttons-copy buttons-html5")

I don't want use Selenium, because it takes a lot of time. Please use Beautiful Soup.

Lennart Thamm · Accepted Answer

To me it seems pretty unnecessary to use any sort of web scraping here. Since you can download the data as a file anyway, it is inadequate to go through the parsing you would need to represent that data via scrapping.

Instead you could just download the data and read it into a pandas dataframe. You will need to have pandas installed, in case you have Anaconda installed, you might already have it on your computer, otherwise you might need to download Anaconda and instal pandas: conda install pandas

More Information on Installing Pandas

With pandas you can read in the data directly from the excel-sheet:

import pandas as pd
df = pd.read_excel("dataset.xlsx")

pandas.read_excel documentation

In case that this is making difficulties, you can still convert the excel-sheet to a csv and use pd.read_csv. Notice that you'll want to use correct encoding.

In case that you want to use BeautifulSoup for some reason: You might want to look into how to parse tables. For a normal table, you would want to identify the content you want to scrape correctly. The table on that specific website has an id which is "arzeshdarayi". It is also the only table on that page, so you can also use the

-Tag to select it.

table = soup.find("table", id="arzeshdarayi")
table = soup.select("#arzeshdarayi")

The table on the website you provided has only a static header, the data is rendered as javascript, and BeautifulSoup won't be able to retrieve the information. Yet you can use the [json-object] that javascript works with and once again, read it in as a dataframe:

import requests
import pandas pd
r = requests.get("http://www.kianfunds2.com/json/gettables.ashx?get=arzeshdarayi")
dict = r.json()
df = pd.DataFrame.from_dict(data)

In case you really want to scrape it, you will need some sort of browser simulation, so the Javascript will be evaluated before you access the html. This answer recommends using Requests_HTML which is a very high level approach to web scraping, that brings together Requests, BS and that renders Javascript. Your code would look somewhat like this:

import requests_html as request
session = request.HTMLSession()
url = "http://www.kianfunds2.com/ارزش-دارایی-ها-و-تعداد-واحد-ها"
r = session.get(url)

#Render the website including javascript
#Uses Chromium (will be downloaded on first execution)
r.html.render(sleep=1) 

#Find the table by it's id and take only the first result
table = r.html.find("#arzeshdarayi")[0] 

#Find the single table rows 
#Loop through those rows
for items in table.find("tr"):
        #Take only the item.text for all elements
        #While extracting the Headings and Data from the Tablerows

        data = [item.text for item in items.find("th,td")[:-1]]
        print(data)

Web scraping with BS4: unable to get table

Answers (1)

Related Questions