ffolkvar
ffolkvar

Reputation: 47

Why my code is not able to scrape from this webpage

So I'm trying to webscrape from this page using beautifulsoup in python https://journals.sagepub.com/toc/CPS/current

My main objective would be to scrape the titles of all the papers that appear there. After checking the inspect structure of the page, I ended up with this code:

url = "https://journals.sagepub.com/toc/CPS/current"
req = Request(url, headers = { "User-Agent": "Mozilla/5.0"})
webpage = urlopen(req).read()
page_soup = BeautifulSoup(webpage,"html.parser")
nameList = page_soup.findAll("h3", {"class":"heading-title"})
List = []
for name in nameList:
    List.append(name.get_text())
nameList

However, for some reason my new list appears always empty. I have used this approach for other pages and I've gotten good results, so I'm not sure what is missing here.

Any ideas?

Upvotes: 1

Views: 79

Answers (3)

Austin
Austin

Reputation: 159

If I understand what you are trying to scrape from the link you have, you want the titles for every article. The code you have is very close, however, there is a span within each of those tags that has the data you are looking for.

This code

import requests
import bs4

# version
print("Requests: {}".format(requests.__version__))
print("Beautiful Soup: {}".format(bs4.__version__))

# soup object
link = "https://journals.sagepub.com/toc/CPS/current"
result = requests.get(link)
soup = bs4.BeautifulSoup(result.content, 'lxml')

# parse for the heading titles
article_names = []
foo = soup.find_all('span', class_='hlFld-Title')
for it in foo:
    print(it.text)
    article_names.append(it.text)

is very similar to yours, the only difference being I parse for the span tags within the tag, and your code parsed for the tag itself.

Code output looks like this:

Requests: 2.25.1
Beautiful Soup: 4.9.3

When Does the Public Get It Right? The Information Environment and the 
Accuracy of Economic Sentiment

Does Affirmative Action Work? Evaluating India’s Quota System

Legacies of Resistance: Mobilization Against Organized Crime in Mexico

Political Institutions and Coups in Dictatorships

Generous to Workers ≠ Generous to All: Implications of European 

Unemployment Benefit Systems for the Social Protection of Immigrants

Drinking Alone: Local Socio-Cultural Degradation and Radical Right 
Support—The Case of British Pub Closures

I hope this is what you're shooting for.

Upvotes: 1

Colonel Thirty Two
Colonel Thirty Two

Reputation: 26539

curl -v https://journals.sagepub.com/toc/CPS/current reveals that the page returns a 302 redirect on the page. urllib won't follow redirects and it returns to you the response telling you to redirect, which won't have the content you are looking for.

Andrej Kesely posted an answer that uses the reqwests library, which does do auto-redirection.

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

Seems that urllib has problem to get the correct result from the server. Try requests module, it's more capable:

import requests
from bs4 import BeautifulSoup

url = "https://journals.sagepub.com/toc/CPS/current"
req = requests.get(url)
page_soup = BeautifulSoup(req.content, "html.parser")
nameList = page_soup.findAll("h3", {"class": "heading-title"})
List = []
for name in nameList:
    List.append(name.get_text())
print(List)

Prints:

[
    "When Does the Public Get It Right? The Information Environment and the Accuracy of Economic Sentiment",
    "Does Affirmative Action Work? Evaluating India’s Quota System",
    "Legacies of Resistance: Mobilization Against Organized Crime in Mexico",
    "Political Institutions and Coups in Dictatorships",
    "Generous to Workers ≠ Generous to All: Implications of European Unemployment Benefit Systems for the Social Protection of Immigrants",
    "Drinking Alone: Local Socio-Cultural Degradation and Radical Right Support—The Case of British Pub Closures",
]

Upvotes: 1

Related Questions