Reputation: 1
I'm trying to scrape all the pages in this website: https://www.edison.k12.nj.us/directory?const_page=1&. I thought that I could go to the next page by replacing the number 1 with 2,3,4 and so on. However, this was not the case because when I checked href attribute of the tags, it doesn't seem to link to a new page. In this case, how can I scrape multiple pages in this case? Thank you so much!
page = 1
df_list = []
df = None
while(page < 240):
url = 'https://www.edison.k12.nj.us/directory?const_page='+str(page)+'&'
# gets back the beautiful soup object
bs = create_beautiful(url)
#calls the extract_data to get necessary data()
df2 = extract_data(bs)
if page == 1:
df = df2
else:
df_list.append(df2)
page+=1
count = 1
for df2 in df_list:
df.append(df2 , ignore_index = True)
count+=1
to_csv_and_excel(df, 'edison_township_public')
Upvotes: 0
Views: 93
Reputation: 1724
You can see if any requests are being sent from the server or to the server in the dev tools -> network -> Fetch/XHR tab. Try to click on the next page and you'll this link in the headers tab:
https://www.edison.k12.nj.us/fs/elements/59?const_page=1&is_draft=false&is_load_more=true&parent_id=59&_=1629643598511
You can try to do a very basic for
in range()
loop and replace const_page={VALUE}
and parent_id=59&_=162964359851{VALUE}
with loop values.
Note: it is slow and needs to be replaced with a faster solution if needed.
for index in range(1, 240):
params = {
'const_page': index,
'is_draft': 'false',
'is_load_more': 'true',
'parent_id': '59',
'_': f'162964359851{index}' # only LAST number changing on each page. Same as const_page number.
}
html = requests.get(f"https://www.edison.k12.nj.us/fs/elements/59", params=params)
soup = BeautifulSoup(html.text, 'lxml')
title = soup.select_one('.fsConstituentProfileLink').text
--------
'''
Donna Abatemarco
Irina Acha
Philip Adornato
Victoria Ajijedidun
Taylor Aljian
Kelly Amabile
Elizabeth Andrade
Deliane Antonio
Nicole Aravena
Pamela Aurilio
Aimee Baer
Sharmila Balaji
Meghan Banach
... more names
'''
Upvotes: 1