Abadian
Abadian

Reputation: 1

Scraping a table with too many rows

I want to use Python to get all the tables on the website 'https://www.tgju.org/archive/price_dollar_rl'

and I write:

import requests
import pandas as pd

url = 'https://www.tgju.org/archive/price_dollar_rl'
html = requests.get(url).content

df_list = pd.read_html(html)
df = df_list[-1]

print(df)
df.to_csv('myy data.csv')

But only one of the 95 tables is saved. What should I do to save all the tables? //

Upvotes: 0

Views: 128

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195553

To get all pages, you can simulate Ajax request and load data directly from API:

import requests
import pandas as pd
from bs4 import BeautifulSoup

query = {
    "lang": "fa",
    "order_dir": ["asc", ""],
    "draw": "9",
    "columns[0][data]": "0",
    "columns[0][name]": "",
    "columns[0][searchable]": "true",
    "columns[0][orderable]": "true",
    "columns[0][search][value]": "",
    "columns[0][search][regex]": "false",
    "columns[1][data]": "1",
    "columns[1][name]": "",
    "columns[1][searchable]": "true",
    "columns[1][orderable]": "true",
    "columns[1][search][value]": "",
    "columns[1][search][regex]": "false",
    "columns[2][data]": "2",
    "columns[2][name]": "",
    "columns[2][searchable]": "true",
    "columns[2][orderable]": "true",
    "columns[2][search][value]": "",
    "columns[2][search][regex]": "false",
    "columns[3][data]": "3",
    "columns[3][name]": "",
    "columns[3][searchable]": "true",
    "columns[3][orderable]": "true",
    "columns[3][search][value]": "",
    "columns[3][search][regex]": "false",
    "columns[4][data]": "4",
    "columns[4][name]": "",
    "columns[4][searchable]": "true",
    "columns[4][orderable]": "true",
    "columns[4][search][value]": "",
    "columns[4][search][regex]": "false",
    "columns[5][data]": "5",
    "columns[5][name]": "",
    "columns[5][searchable]": "true",
    "columns[5][orderable]": "true",
    "columns[5][search][value]": "",
    "columns[5][search][regex]": "false",
    "columns[6][data]": "6",
    "columns[6][name]": "",
    "columns[6][searchable]": "true",
    "columns[6][orderable]": "true",
    "columns[6][search][value]": "",
    "columns[6][search][regex]": "false",
    "columns[7][data]": "7",
    "columns[7][name]": "",
    "columns[7][searchable]": "true",
    "columns[7][orderable]": "true",
    "columns[7][search][value]": "",
    "columns[7][search][regex]": "false",
    "start": "0",
    "length": "30",
    "search": "",
    "order_col": "",
    "from": "",
    "to": "",
    "convert_to_ad": "1",
    # "_": "1624699477042"
}

url = "https://api.accessban.com/v1/market/indicator/summary-table-data/price_dollar_rl"

out = []
for start in range(0, 10):  # <-- increase number of pages here
    print("Getting page {}...".format(start))
    query["start"] = start * 30

    data = requests.get(url, params=query).json()
    out.extend(data["data"])

df = pd.DataFrame(out)
df[4] = df[4].apply(lambda x: BeautifulSoup(x, "html.parser").text)
df[5] = df[5].apply(lambda x: BeautifulSoup(x, "html.parser").text)

print(df)
df.to_csv("data.csv", index=False)

Prints:

           0        1        2        3      4       5           6           7
0    241,690  241,190  242,440  241,890    100   0.04%  2021/06/24   1400/04/3
1    243,310  240,790  243,340  241,790    880   0.36%  2021/06/23   1400/04/2
2    241,190  241,190  243,140  242,670   1680    0.7%  2021/06/22   1400/04/1
3    239,940  239,190  241,440  240,990   1390   0.58%  2021/06/21  1400/03/31
4    234,810  234,690  240,440  239,600   4710   2.01%  2021/06/20  1400/03/30
5    244,490  234,690  244,640  234,890   9400      4%  2021/06/19  1400/03/29
6    242,010  241,950  244,640  244,290   2540   1.05%  2021/06/17  1400/03/27
7    240,470  239,450  242,250  241,750   1260   0.52%  2021/06/16  1400/03/26
8    239,970  239,950  240,050  240,490    770   0.32%  2021/06/15  1400/03/25
9    240,970  238,550  241,050  239,720   1310   0.55%  2021/06/14  1400/03/24
10   238,970  238,940  241,250  241,030   3280   1.38%  2021/06/13  1400/03/23
11   236,830  236,140  238,350  237,750   1480   0.62%  2021/06/12  1400/03/22
12   240,010  239,140  240,450  239,230   2210   0.92%  2021/06/10  1400/03/20

...

And saves data.csv (screenshot from LibreOffice):

enter image description here

Upvotes: 1

T. Kelher
T. Kelher

Reputation: 1186

Well, first of all, there is a difference between the url in the text and the url in the code.

Second, the site uses pagination so you'd need to use something like selenium to press the next button on the site automatically through a script, and fetch the html to then convert it to csv

Upvotes: 0

Related Questions