Reputation: 4634
I was practicing the crawler by using Python
.
My target is to find the test date on GRE website.
Here is what I've done now.
import urllib2
from bs4 import BeautifulSoup
from urllib2 import urlopen, Request
gre_url = 'https://ereg.ets.org/ereg/public/testcenter/availability/seats?testId=30&testName=GRE+General+Test&location=Taipei+City%2C+Taiwan&latitude=25.0329636&longitude=121.56542680000007&testStartDate=April-01-2017&testEndDate=May-31-2017¤tTestCenterCount=0&sourceTestCenterCount=0&adminCode=&rescheduleFlow=false&isWorkflow=true&oldTestId=30&oldTestTime=&oldTestCenterId=&isUserLoggedIn=true&oldTestTitle=&oldTestCenter=&oldTestType=&oldTestDate=&oldTestTimeInfo=&peviewTestSummaryURL=%2Fresch%2Ftestpreview%2Fpreviewtestsummary&rescheduleURL='
data = urllib2.urlopen(gre_url).read()
soup = BeautifulSoup(data, "html.parser")
print soup.select('div.panel-heading.accordion-heading') # return []
However, it seems that it can't extract the element div.panel-heading.accordion-heading
from data
.
How do I fix it?
Upvotes: 1
Views: 184
Reputation: 474241
You need to do it in multiple steps visiting subsequent URLs before making the final get requests to check the availability. Here is something that works for me using requests.Session()
:
import json
import requests
from bs4 import BeautifulSoup
start_url = "https://www.ets.org/gre/revised_general/register/centers_dates/"
workflow_url = "https://ereg.ets.org/ereg/public/workflowmanager/schlWorkflow?_p=GRI"
seats_url = "https://ereg.ets.org/ereg/public/testcenter/availability/seats"
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}
session.get(start_url)
session.get(workflow_url)
response = session.get("https://ereg.ets.org/ereg/public/testcenter/availability/seats?testId=30&testName=GRE+General+Test&location=New+York%2C+NY%2C+United+States&latitude=40.7127837&longitude=-74.00594130000002&testStartDate=March-27-2017&testEndDate=April-30-2017¤tTestCenterCount=0&sourceTestCenterCount=0&adminCode=&rescheduleFlow=false&isWorkflow=true&oldTestId=30&oldTestTime=&oldTestCenterId=&isUserLoggedIn=true&oldTestTitle=&oldTestCenter=&oldTestType=&oldTestDate=&oldTestTimeInfo=&peviewTestSummaryURL=%2Fresch%2Ftestpreview%2Fpreviewtestsummary&rescheduleURL=")#
soup = BeautifulSoup(response.content, "html.parser")
result = json.loads(soup.select_one('#findSeatResponse')['value'])
for date in result['sortedDates']:
print(date['displayDate'])
Of course, change the last URL to the desired one.
Upvotes: 4