Tahsin Alam
Tahsin Alam

Reputation: 157

Read tables from HTML page by changing the ID using python

I am using the html link below to read the table in the page:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664

The last part of the link(allbin) is an ID. This ID changes and by using different IDs you can access different tables and records. The link although remains the same, just the ID in the end changes every time. I have like 1000 more different IDs like this. So, How can I actually use those different IDs to access different tables and join them together?

Output Like this,

ID         Number         Type             FileDate
2016664   NB 14581-26     New Building    12/21/2020
4257909   NB 1481-29      New Building    3/6/2021
4138920   NB 481-29       New Building    9/4/2020

List of other ID for use:

['4257909', '4138920', '4533715']

This was my attempt, I can read a single table with this code.

import requests
import pandas as pd

url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
    
df 

Upvotes: 1

Views: 549

Answers (2)

Henry Eleonu
Henry Eleonu

Reputation: 16

The code below will extract all the tables in a web page

import numpy as np

import pandas as pd

url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'

df_list = pd.read_html(url) #returns as list of dataframes from the web page

print(len(df_list)) #print the number of dataframes

i = 0

while i < len(df_list): #loop through the list to print all tables

df = df_list[i]

print(df)

i = i + 1

Upvotes: 0

Andrej Kesely
Andrej Kesely

Reputation: 195543

To get all pages from list of IDs you can use next example:

import requests
import pandas as pd
from io import StringIO

url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"


def get_info(ID, page=1):
    out = []
    while True:
        try:
            print("ID: {} Page: {}".format(ID, page))
            t = requests.get(url.format(ID, page), timeout=1).text
            df = pd.read_html(StringIO(t))[3].loc[1:, :]
            if len(df) == 0:
                break
            df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
            df["ID"] = ID
            out.append(df)
            page += 25
        except requests.exceptions.ReadTimeout:
            print("Timeout...")
            continue
    return out


list_of_ids = [2016664, 4257909, 4138920, 4533715]

dfs = []
for ID in list_of_ids:
    dfs.extend(get_info(ID))

df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)

Prints:

                                                                              NUMBER                                                                            NUMBER                                                                              TYPE                                                                         FILE DATE       ID
1                                                                      ALT 1469-1890                                                                               NaN                                                                        ALTERATION                                                                        00/00/0000  2016664
2                                                                      ALT 1313-1874                                                                               NaN                                                                        ALTERATION                                                                        00/00/0000  2016664
3                                                                        BN 332-1938                                                                               NaN                                                                   BUILDING NOTICE                                                                        00/00/0000  2016664
4                                                                        BN 636-1916                                                                               NaN                                                                   BUILDING NOTICE                                                                        00/00/0000  2016664
5                                                                    CO NB 1295-1923                                                                             (PDF)                                                          CERTIFICATE OF OCCUPANCY                                                                        00/00/0000  2016664

...

And saves data.csv (screenshot from LibreOffice):

enter image description here

Upvotes: 1

Related Questions