Reputation: 23
I'm using
requests.get('https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists')
like so:
import requests
from bs4 import BeautifulSoup
url = 'https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists'
thepage = requests.get(url)
urlsoup = BeautifulSoup(thepage.text, "html.parser")
print(urlsoup.find_all("a", attrs={"class": "large-3 medium-3 cell image"})[0])
But it keeps scraping not from the full url, but just from the homepage ('https://www.pastemagazine.com'). I can tell because I expect the print statement to print:
<a class="large-3 medium-3 cell image" href="/articles/2018/12/the-funniest-tweets-of-the-week-109.html" aria-label="">
<picture data-sizes="["(min-width: 40em)","(min-width: 64em)"]" class="lazyload" data-sources="["https://cdn.pastemagazine.com/www/opt/120/dogcrp-72x72.jpg","https://cdn.pastemagazine.com/www/opt/120/dogcrp-151x151.jpg","https://cdn.pastemagazine.com/www/opt/120/dogcrp-151x151.jpg"]">
<img alt="" />
</picture>
</a>
But instead it prints:
<a aria-label='Daily Dose: Michael Chapman feat. Bridget St. John, "After All This Time"' class="large-3 medium-3 cell image" href="/articles/2019/01/daily-dose-michael-chapman-feat-bridget-st-john-af.html">
<picture class="lazyload" data-sizes='["(min-width: 40em)","(min-width: 64em)"]' data-sources='["https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-72x72.jpg","https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-151x151.jpg","https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-151x151.jpg"]'>
<img alt='Daily Dose: Michael Chapman feat. Bridget St. John, "After All This Time"'/>
</picture>
</a>
Which corresponds to an element on the homepage, rather than the specific url I want to scrape from with the search terms. Why does it redirect to the homepage? How can I stop it from doing so?
Upvotes: 0
Views: 5229
Reputation: 2919
If you're sure about the redirection part, you can set the allow_redirects
to False
to prevent redirection.
r = requests.get(url, allow_redirects=False)
Upvotes: 3
Reputation: 22440
To get the required urls connected to tweets, you can try the following script. Turn out that using headers along with cookies solves the redirection issues.
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists"
with requests.Session() as s:
res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
for item in set([urljoin(url,item.get("href")) for item in soup.select("ul.articles a[href*='tweets-of-the-week']")]):
print(item)
Or to make it even easier, upgrade the following libraries:
pip3 install lxml --upgrade
pip3 install beautifulsoup4 --upgrade
And then try:
with requests.Session() as s:
res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select("a.noimage[href*='tweets-of-the-week']"):
print(urljoin(url,item.get("href")))
Upvotes: 0