Reputation: 157
I am using the html link below to read the table in the page:
http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664
The last part of the link(allbin)
is an ID. This ID changes and by using different IDs you can access different tables and records. The link although remains the same, just the ID in the end changes every time. I have like 1000 more different IDs like this. So, How can I actually use those different IDs to access different tables and join them together?
Output Like this,
ID Number Type FileDate
2016664 NB 14581-26 New Building 12/21/2020
4257909 NB 1481-29 New Building 3/6/2021
4138920 NB 481-29 New Building 9/4/2020
List of other ID for use:
['4257909', '4138920', '4533715']
This was my attempt, I can read a single table with this code.
import requests
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
html = requests.get(url).content
df_list = pd.read_html(html,header=0)
df = df_list[3]
df
Upvotes: 1
Views: 549
Reputation: 16
The code below will extract all the tables in a web page
import numpy as np
import pandas as pd
url = 'http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin=2016664'
df_list = pd.read_html(url) #returns as list of dataframes from the web page
print(len(df_list)) #print the number of dataframes
i = 0
while i < len(df_list): #loop through the list to print all tables
df = df_list[i]
print(df)
i = i + 1
Upvotes: 0
Reputation: 195543
To get all pages from list of IDs you can use next example:
import requests
import pandas as pd
from io import StringIO
url = "http://a810-bisweb.nyc.gov/bisweb/ActionsByLocationServlet?requestid=1&allbin={}&allcount={}"
def get_info(ID, page=1):
out = []
while True:
try:
print("ID: {} Page: {}".format(ID, page))
t = requests.get(url.format(ID, page), timeout=1).text
df = pd.read_html(StringIO(t))[3].loc[1:, :]
if len(df) == 0:
break
df.columns = ["NUMBER", "NUMBER", "TYPE", "FILE DATE"]
df["ID"] = ID
out.append(df)
page += 25
except requests.exceptions.ReadTimeout:
print("Timeout...")
continue
return out
list_of_ids = [2016664, 4257909, 4138920, 4533715]
dfs = []
for ID in list_of_ids:
dfs.extend(get_info(ID))
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=None)
Prints:
NUMBER NUMBER TYPE FILE DATE ID
1 ALT 1469-1890 NaN ALTERATION 00/00/0000 2016664
2 ALT 1313-1874 NaN ALTERATION 00/00/0000 2016664
3 BN 332-1938 NaN BUILDING NOTICE 00/00/0000 2016664
4 BN 636-1916 NaN BUILDING NOTICE 00/00/0000 2016664
5 CO NB 1295-1923 (PDF) CERTIFICATE OF OCCUPANCY 00/00/0000 2016664
...
And saves data.csv
(screenshot from LibreOffice):
Upvotes: 1