Reputation: 47
So I'm trying to webscrape from this page using beautifulsoup in python https://journals.sagepub.com/toc/CPS/current
My main objective would be to scrape the titles of all the papers that appear there. After checking the inspect structure of the page, I ended up with this code:
url = "https://journals.sagepub.com/toc/CPS/current"
req = Request(url, headers = { "User-Agent": "Mozilla/5.0"})
webpage = urlopen(req).read()
page_soup = BeautifulSoup(webpage,"html.parser")
nameList = page_soup.findAll("h3", {"class":"heading-title"})
List = []
for name in nameList:
List.append(name.get_text())
nameList
However, for some reason my new list appears always empty. I have used this approach for other pages and I've gotten good results, so I'm not sure what is missing here.
Any ideas?
Upvotes: 1
Views: 79
Reputation: 159
If I understand what you are trying to scrape from the link you have, you want the titles for every article. The code you have is very close, however, there is a span within each of those tags that has the data you are looking for.
This code
import requests
import bs4
# version
print("Requests: {}".format(requests.__version__))
print("Beautiful Soup: {}".format(bs4.__version__))
# soup object
link = "https://journals.sagepub.com/toc/CPS/current"
result = requests.get(link)
soup = bs4.BeautifulSoup(result.content, 'lxml')
# parse for the heading titles
article_names = []
foo = soup.find_all('span', class_='hlFld-Title')
for it in foo:
print(it.text)
article_names.append(it.text)
is very similar to yours, the only difference being I parse for the span tags within the tag, and your code parsed for the tag itself.
Code output looks like this:
Requests: 2.25.1 Beautiful Soup: 4.9.3 When Does the Public Get It Right? The Information Environment and the Accuracy of Economic Sentiment Does Affirmative Action Work? Evaluating India’s Quota System Legacies of Resistance: Mobilization Against Organized Crime in Mexico Political Institutions and Coups in Dictatorships Generous to Workers ≠ Generous to All: Implications of European Unemployment Benefit Systems for the Social Protection of Immigrants Drinking Alone: Local Socio-Cultural Degradation and Radical Right Support—The Case of British Pub Closures
I hope this is what you're shooting for.
Upvotes: 1
Reputation: 26539
curl -v https://journals.sagepub.com/toc/CPS/current
reveals that the page returns a 302
redirect on the page. urllib
won't follow redirects and it returns to you the response telling you to redirect, which won't have the content you are looking for.
Andrej Kesely posted an answer that uses the reqwests library, which does do auto-redirection.
Upvotes: 1
Reputation: 195408
Seems that urllib
has problem to get the correct result from the server. Try requests
module, it's more capable:
import requests
from bs4 import BeautifulSoup
url = "https://journals.sagepub.com/toc/CPS/current"
req = requests.get(url)
page_soup = BeautifulSoup(req.content, "html.parser")
nameList = page_soup.findAll("h3", {"class": "heading-title"})
List = []
for name in nameList:
List.append(name.get_text())
print(List)
Prints:
[
"When Does the Public Get It Right? The Information Environment and the Accuracy of Economic Sentiment",
"Does Affirmative Action Work? Evaluating India’s Quota System",
"Legacies of Resistance: Mobilization Against Organized Crime in Mexico",
"Political Institutions and Coups in Dictatorships",
"Generous to Workers ≠ Generous to All: Implications of European Unemployment Benefit Systems for the Social Protection of Immigrants",
"Drinking Alone: Local Socio-Cultural Degradation and Radical Right Support—The Case of British Pub Closures",
]
Upvotes: 1