Python crawler can't find element

Question

I was practicing the crawler by using Python.

My target is to find the test date on GRE website.

Here is what I've done now.

import urllib2
from bs4 import BeautifulSoup
from urllib2 import urlopen, Request

gre_url = 'https://ereg.ets.org/ereg/public/testcenter/availability/seats?testId=30&testName=GRE+General+Test&location=Taipei+City%2C+Taiwan&latitude=25.0329636&longitude=121.56542680000007&testStartDate=April-01-2017&testEndDate=May-31-2017¤tTestCenterCount=0&sourceTestCenterCount=0&adminCode=&rescheduleFlow=false&isWorkflow=true&oldTestId=30&oldTestTime=&oldTestCenterId=&isUserLoggedIn=true&oldTestTitle=&oldTestCenter=&oldTestType=&oldTestDate=&oldTestTimeInfo=&peviewTestSummaryURL=%2Fresch%2Ftestpreview%2Fpreviewtestsummary&rescheduleURL='
data = urllib2.urlopen(gre_url).read()
soup = BeautifulSoup(data, "html.parser")
print soup.select('div.panel-heading.accordion-heading') # return []

However, it seems that it can't extract the element div.panel-heading.accordion-heading from data. How do I fix it?

alecxe · Accepted Answer

You need to do it in multiple steps visiting subsequent URLs before making the final get requests to check the availability. Here is something that works for me using requests.Session():

import json

import requests
from bs4 import BeautifulSoup


start_url = "https://www.ets.org/gre/revised_general/register/centers_dates/"
workflow_url = "https://ereg.ets.org/ereg/public/workflowmanager/schlWorkflow?_p=GRI"
seats_url = "https://ereg.ets.org/ereg/public/testcenter/availability/seats"
with requests.Session() as session:
    session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'}

    session.get(start_url)
    session.get(workflow_url)
    response = session.get("https://ereg.ets.org/ereg/public/testcenter/availability/seats?testId=30&testName=GRE+General+Test&location=New+York%2C+NY%2C+United+States&latitude=40.7127837&longitude=-74.00594130000002&testStartDate=March-27-2017&testEndDate=April-30-2017¤tTestCenterCount=0&sourceTestCenterCount=0&adminCode=&rescheduleFlow=false&isWorkflow=true&oldTestId=30&oldTestTime=&oldTestCenterId=&isUserLoggedIn=true&oldTestTitle=&oldTestCenter=&oldTestType=&oldTestDate=&oldTestTimeInfo=&peviewTestSummaryURL=%2Fresch%2Ftestpreview%2Fpreviewtestsummary&rescheduleURL=")#

    soup = BeautifulSoup(response.content, "html.parser")
    result = json.loads(soup.select_one('#findSeatResponse')['value'])
    for date in result['sortedDates']:
        print(date['displayDate'])

Of course, change the last URL to the desired one.

Python crawler can't find element

Answers (1)

Related Questions

Python crawler can&#39;t find element

Answers (1)

Related Questions

Python crawler can't find element