Beautifulsoup requests.get() redirects from specified url

Question

I'm using

requests.get('https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists')

like so:

import requests
from bs4 import BeautifulSoup
url = 'https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists'
thepage = requests.get(url)
urlsoup = BeautifulSoup(thepage.text, "html.parser")
print(urlsoup.find_all("a", attrs={"class": "large-3 medium-3 cell image"})[0])

But it keeps scraping not from the full url, but just from the homepage ('https://www.pastemagazine.com'). I can tell because I expect the print statement to print:

But instead it prints:

Which corresponds to an element on the homepage, rather than the specific url I want to scrape from with the search terms. Why does it redirect to the homepage? How can I stop it from doing so?

SIM · Accepted Answer

To get the required urls connected to tweets, you can try the following script. Turn out that using headers along with cookies solves the redirection issues.

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists"

with requests.Session() as s:
    res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,'lxml')
    for item in set([urljoin(url,item.get("href")) for item in soup.select("ul.articles a[href*='tweets-of-the-week']")]):
        print(item)

Or to make it even easier, upgrade the following libraries:

pip3 install lxml --upgrade
pip3 install beautifulsoup4 --upgrade

And then try:

with requests.Session() as s:
    res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,'lxml')
    for item in soup.select("a.noimage[href*='tweets-of-the-week']"):
        print(urljoin(url,item.get("href")))

Beautifulsoup requests.get() redirects from specified url

Answers (2)

Related Questions