Reputation: 337
I need to scrape the entire HTML from journal_url, which for the purpose of this example will be http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1467-6281/issues . I have followed the requests examples displayed on a few questions on this site, but I am not getting the correct HTML returned with either the .text or .json() methods for requests.get. My goal is to display the whole HTML including the ordered list underneath each year and volume pull-down.
import requests
import pandas as pd
import http.cookiejar
for i in range(0,len(df)):
journal_name = df.loc[i,"Journal Full Title"]
journal_url = df.loc[i,"URL"]+"/issues"
access_start = df.loc[i,"Content Start Date"]
access_end = df.loc[i,"Content End Date"]
#cj = http.cookiejar.CookieJar()
#opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
headers = {"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"}
r = requests.get(journal_url, headers=headers)
response = r.text
print(response)
Upvotes: 0
Views: 2711
Reputation: 22440
If your ultimate goal is to parse the content you mentioned above from that page, then here it is:
import requests ; from bs4 import BeautifulSoup
base_link = "http://onlinelibrary.wiley.com" ; main_link = "http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1467-6281/issues"
def abacus_scraper(main_link):
soup = BeautifulSoup(requests.get(main_link).text, "html.parser")
for titles in soup.select("a.issuesInYear"):
title = titles.select("span")[0].text
title_link = titles.get("href")
main_content(title, title_link)
def main_content(item, link):
broth = BeautifulSoup(requests.get(base_link + link).text, "html.parser")
elems = [issue.text for issue in broth.select("div.issue a")]
print(item, elems)
abacus_scraper(main_link)
Upvotes: 1