Reputation: 353
With the following url:
I am attempting to scrape the results of the table presented here. The issue is that, no matter what, the search results are limited to 25/page and as you can see - there are thousands of results over multiple pages.
Ive attempted to change the begin and end date to no avail.
When I scrape using beautiful soup, I can only scrape page 1 of the results, then the scrape stops. What I am missing to scrape (in this case), all 85 pages of results? (and - my code is successful, but only returning a scrape from page 1 of the results).
Here is my code:
blah = []
html = 'https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=Celticss&PlayerMovementChkBx=yes&submit=Search&start=0'
webpage = requests.get(html)
content = webpage.content
soup = BeautifulSoup(content)
for item in soup.find_all('tr'):
for value in item.find_all('td'):
gm = value.text
blah.append(gm)
Upvotes: 1
Views: 1422
Reputation: 1131
Ad a do loop around your whole snippet that scrapes one of the tables, and increment the url by 25. In the snippet below I just made a counter variable that's initially zero, and gets incremented by 25 each loop. The code will break the loop when the response to the request is no longer valid, meaning you hit an error or hit the end of your search results. You could maybe modify that statement to break if it 404's, or print the error, etc.
Code below not tested, just a demonstration of my concept.
blah = []
url = 'https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=Celticss&PlayerMovementChkBx=yes&submit=Search&start='
counter = 0
while True:
url += str(counter)
webpage = requests.get(url)
if webpage.status_code != 200:
break
content = webpage.content
soup = BeautifulSoup(content)
for item in soup.find_all('tr'):
for value in item.find_all('td'):
gm = value.text
blah.append(gm)
counter += 25
Upvotes: 1