Can't GET past pagination with Requests

Synopsis: Given a web response to a ‘selenium’ submitted query string, I cannot get ‘requests’ to GET the href’s, nor can I get past the pagination (which only displays the first 20 articles) to scape thousands of articles.

I am using my local library website to connect to an paid online subscription database site run by Infotrac called “Florida Newspaper Database”. Initially, I use Python and selenium to run a web driver instance to log into the local library site grab their params, then open the main Infotrac site to capture its params, open the Florida Newspaper Database site and submit a search string. I went to selenium because I was unable to get ‘requests’ to do it.

All of this works, inelegantly so say the least. However, once I get the response from the Florida Newspaper Database, I am faced with two hurdles I have been unable to overcome. The response to my query, in this case “byline john romano” generates over three thousand articles, all of which I want to programmatically download. I’m trying to get ‘requests’ to handle the download, but so far without any success.

The initial response page to the search string only displays links (href) for the first 20 articles. Using Beautifulsoup I can capture the url’s in a list. However, I haven’t had any success in using requests to GET the href pages. Even if I could, I still face the pagination issue with 20 displayed articles out of thousands.

While I love the idea of ‘requests’ it has been a bear to learn and work with. Reading the doc’s goes only so far. I bought ‘Essential Requests’ from Packt Publishing and found it horrendous. Does anyone have a reading list for requests?

import requests
from requests import Session
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from selenium.webdriver.common.keys import Keys


# opening the library page and finding the input elements

browser = webdriver.Firefox()
browser.get("https://pals.polarislibrary.com/polaris/logon.aspx")
username = browser.find_element_by_id("textboxBarcodeUsername")
password = browser.find_element_by_id("textboxPassword")
button = browser.find_element_by_id("buttonSubmit")

# inputing username and password

username.send_keys("25913000925235")
password.send_keys("9963")
button.send_keys(Keys.ENTER)

# opening the infotract page with the right cookies in the browser url

browser.get("http://infotrac.galegroup.com/itweb/palm83799?db=SP19")

# finding the input elements, first username

idFLNDB = browser.find_element_by_name("id")
idFLNDB.send_keys("25913000925235")

# finding the "Proceed" button by xpath because there's no name or id     and clicking it

submit = browser.find_element_by_xpath("//input[@type='submit']")
submit.send_keys(Keys.ENTER)

# now get the Florida Newspaper Database page, find input element

searchBox = browser.find_element_by_id("inputFieldValue_0")
homepage = browser.find_element_by_id(“homepage_submit")

# input your search string

searchTopic = input("Type in your search string: ")
searchBox.send_keys(searchTopic)
homepage.send_keys(Keys.ENTER)

# get the cookies from selenium's webbrowser instance

cookies = browser.get_cookies()

# open up a requests session

s = requests.Session()

# get the cookies from selenium to requests

for cookie in cookies:
    s.cookies.set(cookie['name'], cookie['value'])


searchTopic1 = searchTopic.replace(' ', '+')

# This is the param from the main search page

payload = {
    "inputFieldValue(0)": searchTopic1,
    "inputFieldName(0)": "OQE",
    "inputFieldName(0)": "OQE",
    "nwf": "y",
    "searchType": "BasicSearchForm",
    "userGroupName": "palm83799",
    "prodId": "SPJ.SP19",
    "method": "doSearch",
    "dblist": "",
    "standAloneLimiters": "LI",
}

current_url = browser.current_url

response = s.get(current_url, data=payload)
print("This is the status code:", response.status_code)
print("This is the current url:", current_url)

# This gives you BeautifulSoup object

soup = BeautifulSoup(response.content, "lxml")

# This gives you all of the article tags

links = soup.find_all(class_="documentLink")

# This next portion gives you the href values from the article tags as     a list titled linksUrl

linksUrl = []
for i in range(len(links)):
    a = links[i]['href']
    linksUrl.append(a)
    i +=1

# These are the param's from the article links off of the basic search page
payload2 = {
    "sort": "DA-SORT",
    "docType": "Column",
    "}tabID": "T004",
    "prodId": "SPJ.SP19",
    "searchId": "R1",
    "resultType": "RESULT_LIST",
    "searchType": "BasicSearchForm"
}


# These are the request headers from a single article that I opened
articlePayload ={
    "Host":"code.jquery.com",
    "User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv41.0)     Gecko/20100101 Firefox/41.0",
    "Accept":"*/*",
    "Accept-Language":"en-US,en;q=0.5",
    "Accept-Encoding":"gzip,deflate",
    "Referer":"http://askalibrarian.org/widgets/gale/statewide",
    "Connection":"keep-alive"

Upvotes: 3

Answers (1)

Andrés Pérez-Albela H.

Reputation: 4021

I've created a PoC to help you understand how could you do it with requests library.

This script only scrapes:

title and link of every news/article within every page of the search results for the provided keyword(s)

You can adapt the code to scrape the specific data you are interested in.

The code has comments so I won't explain too much outside the code. Yet, if you have any further questions just let me know.

from lxml import html
from requests import Session

## Setting some vars
LOGIN_URL = "http://infotrac.galegroup.com/default/palm83799?db=SP19"
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"

## Payload for LOGIN_URL page
payload = {
    'db':'SP19',
    'locpword':'25913000925235',
    'proceed':'Authenticate',
}

## Headers to be set for every request with our requests.Session()
headers = {
    'User-Agent':USER_AGENT
}

## requests.Session insance
s = Session()

## Updating/setting headers to be used in every request within our Session()
s.headers.update(headers)

## Making first request to our LOGIN_URL page to get Cookies and Sessions we will need later
s.get(LOGIN_URL)

def extractTitlesAndLinksFromPaginatePageResponse(response, page):
    ## Creating a dictionary with the following structure
    ## {
    ##     page: { ## this value is the page number
    ##         "news": None # right now we leave it as None until we have all the news (dict), from this page, scraped
    ##     }
    ## }
    ##
    ## e.g.
    ##
    ## {
    ##     1: {
    ##        "news": None # right now we leave it as None until we have all the news (dict), from this page, scraped
    ##     }
    ## }
    ##
    news = {page: dict(news=None)}

    ## count = The result's number. e.g. The first result from this page will be 1, the second result will be 2, and so on until 20.
    count = 1

    ## Parsing the HTML from response.content
    tree = html.fromstring(response.content)

    ## Creating a dictionary with the following structure
    ## {
    ##     count: { ## count will be the result number for the current page
    ##            "title": "Here goes the news title",
    ##            "link": "Here goes the news link",
    ##     }
    ## }
    ##
    ## e.g.
    ##
    ## {
    ##     1: {
    ##        "title": "Drought swept aside; End-of-angst story? This is much more.",
    ##        "link": "http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=1921&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA138024966&contentSet=GALE%7CA138024966",
    ##     },
    ##     2: {
    ##        "title": "The Fast Life.",
    ##        "link": "http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=1922&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA137929858&contentSet=GALE%7CA137929858",
    ##     },
    ##     ...and so on...
    ## }
    tmp_dict = dict()

    ## Applying some xPATHs to extract every result from the current page
    ## Adding "http://go.galegroup.com/ps/" prefix to every result's link we extract
    ## Adding results to tmp_dict
    ## Count increment +1
    for result in tree.xpath('//li[@class="citation-view"]'):
        link, title = result.xpath('.//div[@class="titleWrapper"]/span[@class="title"]/a/@href | .//div[@class="titleWrapper"]/span[@class="title"]/a/text()')
        link = "{}{}".format("http://go.galegroup.com/ps/", link)
        tmp_dict[count] = dict(title=title, link=link)
        count += 1

    ## Asigning tmp_dict as value of news[page]["news"]
    news[page]["news"] = tmp_dict

    ## Returning news dictionary with all of the results from the current page
    return news


def searchKeyWord(search_string):
    ## Creating a dictionary with the following structure
    ## {
    ##     "keyword": search_string,  ## in this case 'search_string' is "byline john romano"
    ##     "pages": None              ## right now we leave it as None until we have all the pages scraped
    ## }
    full_news = dict(keyword=search_string, pages=None)

    ## This will be a temporary dictionary which will contain all the pages and news inside. This is the dict that will be the value of full_news["pages"]
    tmp_dict = dict()

    ## Replacing spaces with 'plus' sign to match the website's behavior
    search_string = search_string.replace(' ', '+')
    ## URL of the first page for every search request
    search_url = "http://go.galegroup.com/ps/basicSearch.do?inputFieldValue(0)={}&inputFieldName(0)=OQE&inputFieldName(0)=OQE&nwf=y&searchType=BasicSearchForm&userGroupName=palm83799&prodId=SPJ.SP19&method=doSearch&dblist=&standAloneLimiters=LI".format(search_string)

    ##
    ## count = Number of the page we are currently scraping
    ## response_code = The response code we should match against every request we make to the pagination endpoint. Once it returns a 500 response code, it means we have reached the last page
    ## currentPosition = It's like an offset var, which contains the value of the next results to be rendered. We will increment its value in 20 for each page we request.
    ##
    count = 1 ## Don't change this value. It should always be 1.
    response_code = 200 ## Don't change this value. It should always be 200.
    currentPosition = 21 ## Don't change this value. It should always be 21.

    ## Making a GET request to the search_url (first results page)
    first_page_response = s.get(search_url)
    ## Calling extractTitlesAndLinksFromPaginatePageResponse() with the response and count (number of the page we are currently scraping)
    first_page_news = extractTitlesAndLinksFromPaginatePageResponse(first_page_response, count)
    ## Updating our tmp_dict with the dict of news returned by extractTitlesAndLinksFromPaginatePageResponse()
    tmp_dict.update(first_page_news)

    ## If response code of last pagination request is not 200 we stop looping
    while response_code == 200:
        count += 1
        paginate_url = "http://go.galegroup.com/ps/paginate.do?currentPosition={}&inPS=true&prodId=SPJ.SP19&searchId=R1&searchResultsType=SingleTab&searchType=BasicSearchForm&sort=DA-SORT&tabID=T004&userGroupName=palm83799".format(currentPosition)
        ## Making a request to the next paginate page with special headers to make sure our script follows the website's behavior
        next_pages_response = s.get(paginate_url, headers={'X-Requested-With':'XMLHttpRequest', 'Referer':search_url})
        ## Updating response code to be checked before making the next paginate request
        response_code = next_pages_response.status_code
        ## Calling extractTitlesAndLinksFromPaginatePageResponse() with the response and count (number of the page we are currently scraping)
        pagination_news = extractTitlesAndLinksFromPaginatePageResponse(next_pages_response, count)
        ## Updating dict with pagination's current page results
        tmp_dict.update(pagination_news)
        ## Updating our offset/position
        currentPosition += 20

    ## Deleting results from 500 response code
    del tmp_dict[count]

    ## When the while loop has finished making requests and extracting results from every page
    ## Pages dictionary, with all the pages and their corresponding results/news, becomes a value of full_news["pages"]
    full_news["pages"] = tmp_dict
    return full_news

## This is the POST request to LOGIN_URL with our payload data and some extra headers to make sure everything works as expected
login_response = s.post(LOGIN_URL, data=payload, headers={'Referer':'http://infotrac.galegroup.com/default/palm83799?db=SP19', 'Content-Type':'application/x-www-form-urlencoded'})

## Once we are logged in and our Session has all the website's cookies and sessions
## We call searchKeyWord() function with the text/keywords we want to search for
## Results will be stored in all_the_news var
all_the_news = searchKeyWord("byline john romano")

## Finally you can
print all_the_news
## Or do whatever you need to do. Like for example, loop all_the_news dictionary to make requests to every news url and scrape the data you are interested in.
## You can also adjust the script (add one more function) to scrape every news detail page data, and call it from inside of extractTitlesAndLinksFromPaginatePageResponse()

It will output something like this: (this is just a sample output to avoid pasting too much data):

{
    'pages': {
        1: {
            'news': {
                1: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=1&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA433496708&contentSet=GALE%7CA433496708',
                    'title': 'ANGER AT DECISIONS BUT APATHY AT POLLS.'
                },
                2: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=2&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA433399216&contentSet=GALE%7CA433399216',
                    'title': 'SMART GUN TECHNOLOGY STARTING TO MAKE SENSE.'
                },
                3: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=3&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA433029222&contentSet=GALE%7CA433029222',
                    'title': 'OF COURSE, FIRE S.C. DEPUTY, BUT MAYBE ...'
                },
                4: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=4&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA432820751&contentSet=GALE%7CA432820751',
                    'title': 'SCHOOL REFORMS MISS REAL PROBLEM.'
                },
                5: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=5&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA432699330&contentSet=GALE%7CA432699330',
                    'title': 'TENSION IS UNNECESSARILY THICK AT CITY HALL.'
                },
                6: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=6&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA432285591&contentSet=GALE%7CA432285591',
                    'title': 'OPT OUT MOVEMENT ON TESTING GETS NOTICE.'
                },
                7: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=7&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA432088310&contentSet=GALE%7CA432088310',
                    'title': 'CREDIT CITY COUNCIL FOR OPTIONS ON RAYS DEAL.'
                },
                8: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=8&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA431979679&contentSet=GALE%7CA431979679',
                    'title': 'FLORIDA CAN PLAY IT SMART ON MARIJUANA.'
                },
                9: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Article&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=9&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA432008411&contentSet=GALE%7CA432008411',
                    'title': 'A PLAY-BY-PLAY LOOK AT LIFE, THE RAYS.'
                },
                10: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=10&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA431632768&contentSet=GALE%7CA431632768',
                    'title': 'QUALITY LACKING AS FLORIDA ADDS JOBS.'
                },
                11: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=11&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA431451912&contentSet=GALE%7CA431451912',
                    'title': 'INSTEAD OF EMPATHY, JUDGE ADDS TO ABUSE.'
                },
                12: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=12&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA431359125&contentSet=GALE%7CA431359125',
                    'title': 'HE WANTS TO CONTROL HIS DEATH, HIS WAY.'
                },
                13: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=13&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA430976221&contentSet=GALE%7CA430976221',
                    'title': "POLITICAL PARTY'S RISE RAVAGED BY 'CRACKPOT'."
                },
                14: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=14&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA430813416&contentSet=GALE%7CA430813416',
                    'title': "STADIUM TALKS VS. HISTORY'S CURVEBALLS."
                },
                15: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=15&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA430729230&contentSet=GALE%7CA430729230',
                    'title': 'OVERHAUL BUSH-ERA EDUCATION REFORMS.'
                },
                16: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=16&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA430430295&contentSet=GALE%7CA430430295',
                    'title': 'BEWARE OF EXTRA FEES FOR CAR TAG RENEWALS.'
                },
                17: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=17&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA430179746&contentSet=GALE%7CA430179746',
                    'title': 'STATE FAILS SICK KIDS, THEN FIGHTS CHANGES.'
                },
                18: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=18&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA430104409&contentSet=GALE%7CA430104409',
                    'title': 'HOW A BIG CHANGED THE LIFE OF A LITTLE.'
                },
                19: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=19&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA429647686&contentSet=GALE%7CA429647686',
                    'title': 'PARK PLAN PUTS HEAT ON RAYS DECISION.'
                },
                20: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=20&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA429444602&contentSet=GALE%7CA429444602',
                    'title': 'SCOTT WILL TAKE CREDIT, BUT DODGES THE BURDEN.'
                }
            }
        },
        2: {
            'news': {
                1: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=21&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA428920357&contentSet=GALE%7CA428920357',
                    'title': 'HARD LINE ON POOR WORSE THAN OFFENSES.'
                },
                2: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=22&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA428643272&contentSet=GALE%7CA428643272',
                    'title': "DON'T RUN THE GRAND PRIX OUT OF TOWN."
                },
                3: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=23&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA428565070&contentSet=GALE%7CA428565070',
                    'title': "PUT JEB'S EDUCATION REFORMS TO THE TEST."
                },
                4: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=24&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA428196500&contentSet=GALE%7CA428196500',
                    'title': 'SINCERE APOLOGY IS A THING OF THE PAST.'
                },
                5: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=25&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA427980323&contentSet=GALE%7CA427980323',
                    'title': 'MISTRUST OF LEADERS DAMAGES EDUCATION.'
                },
                6: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=26&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA428127291&contentSet=GALE%7CA428127291',
                    'title': "ONLY ONE REMEDY FOR CLERK'S CONFLICT."
                },
                7: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=27&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA427578446&contentSet=GALE%7CA427578446',
                    'title': 'LOCAL POT LAWS COULD EASE RIGID PENALTIES.'
                },
                8: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=28&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA427324906&contentSet=GALE%7CA427324906',
                    'title': "UTILITIES' PLAN KEEPS CONSUMERS IN THE DARK."
                },
                9: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=29&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA427220594&contentSet=GALE%7CA427220594',
                    'title': 'JUDGE CONQUERS RETIREMENT WITH VERVE.'
                },
                10: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=30&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA426790479&contentSet=GALE%7CA426790479',
                    'title': 'APOLOGIES WOULD HELP IN SCHOOLS DISCUSSION.'
                },
                11: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=31&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA426560152&contentSet=GALE%7CA426560152',
                    'title': "PARENTS DON'T BACK BUSH'S TEST EMPHASIS."
                },
                12: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=32&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA426493640&contentSet=GALE%7CA426493640',
                    'title': 'POLITICALLY SPEAKING, THIS YEAR IS PATHETIC.'
                },
                13: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=33&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA426051781&contentSet=GALE%7CA426051781',
                    'title': "BLAMING PARENTS WON'T HELP CHILDREN."
                },
                14: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=34&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA425831366&contentSet=GALE%7CA425831366',
                    'title': "ON FAILING SCHOOLS, IT'S TIME FOR ACTION."
                },
                15: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=35&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA425724018&contentSet=GALE%7CA425724018',
                    'title': "SORRY? OUR LEGISLATORS DON'T KNOW THE WORD."
                },
                16: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=36&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA425256127&contentSet=GALE%7CA425256127',
                    'title': 'IN CLOSET, ESSENTIALS FOR MAKING LIVES BETTER.'
                },
                17: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=37&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA425006012&contentSet=GALE%7CA425006012',
                    'title': 'MEET IN MIDDLE ON TAXI, UBER REGULATION.'
                },
                18: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=38&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA424917550&contentSet=GALE%7CA424917550',
                    'title': "A STUNNING LOSS; The Tarpon Springs man who umpired the baseball game where a bat boy was killed is struggling to cope with the 9-year-old's death."
                },
                19: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=39&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA422480556&contentSet=GALE%7CA422480556',
                    'title': 'RAYS HAVE LOTS OF FANS, JUST NOT AT THE TROP.'
                },
                20: {
                    'link': 'http://go.galegroup.com/ps/retrieve.do?sort=DA-SORT&docType=Column&tabID=T004&prodId=SPJ.SP19&searchId=R1&resultListType=RESULT_LIST&searchType=BasicSearchForm&contentSegment=&currentPosition=40&searchResultsType=SingleTab&inPS=true&userGroupName=palm83799&docId=GALE%7CA422342622&contentSet=GALE%7CA422342622',
                    'title': 'TRY AGAIN WHEN IT COMES TO RECYCLING.'
                }
            }
        },
    }
    'keyword': 'byline john romano'
}

Finally, and as suggested in the comments, you could:

Loop all_the_news dictionary to make requests to every news url and scrape the data you are interested in.
Adjust the script (add one more function) to scrape every news detail page data, and call it from inside of extractTitlesAndLinksFromPaginatePageResponse()

I hope this helps you understand better how does requests library works.

Upvotes: 2

Can&#39;t GET past pagination with Requests

Answers (1)

Related Questions

Can't GET past pagination with Requests