Reputation: 1
I'm trying to scrape links that are found in google RssFeeds for a given country.
The links are located in the xml format when you visit this url https://news.google.com/rss/search?q={"Example_Country"}
I am able to parse the links given, but when I use requests they return Javascript and not the actual links as when you click them in a browser.
What are these links google uses in the xml rss feed. And what is the result when using them in requests.get. Ultimately I want to know what's the best way to either get the actual link and scrape them?
So far I am able to Parse the xml file of https://news.google.com/rss/search?q={"Example_Country"}.
But when I try the following approach of:
`headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers, allow_redirects=True)
response.encoding = 'utf-8'`
I have no idea what is returned. I was expected to be redirected to the actual url.
Upvotes: 0
Views: 58
Reputation: 1906
google news uses javascript to redirect to the article, you can get the article's url by replicating the post request:
import requests
import json
from bs4 import BeautifulSoup
google_rss_url = 'https://news.google.com/rss/articles/CBMiogFBVV95cUxQLUFubVFtajlobFdKQllBb1JBRDNPNm8yRE51a0N2STl3SGd6eFY2cEVMdllnM1VwNGRBbWxsa0FLUGViaHNaN2p2R2oyMWFsNklvM3pFdXZFWjNNT1dNN2lrclRzWTlLOEpOLXYzZzdETnFMS0ZUZktvZTREcmU0ZzZucmFYZk0tR1gzTVlEUWh4dXVpMGMxSWltb2x2enl1TEE?oc=5'
resp = requests.get(google_rss_url)
data = BeautifulSoup(resp.text, 'html.parser').select_one('c-wiz[data-p]').get('data-p')
obj = json.loads(data.replace('%.@.', '["garturlreq",'))
payload = {
'f.req': json.dumps([[['Fbv4je', json.dumps(obj[:-6] + obj[-2:]), 'null', 'generic']]])
}
headers = {
'content-type': 'application/x-www-form-urlencoded;charset=UTF-8',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
}
url = "https://news.google.com/_/DotsSplashUi/data/batchexecute"
response = requests.post(url, headers=headers, data=payload)
array_string = json.loads(response.text.replace(")]}'", ""))[0][2]
article_url = json.loads(array_string)[1]
print(article_url)
Upvotes: 1