Jack Holly
Jack Holly

Reputation: 1

How to Scrape Google RssFeed Links?

I'm trying to scrape links that are found in google RssFeeds for a given country.

The links are located in the xml format when you visit this url https://news.google.com/rss/search?q={"Example_Country"}

I am able to parse the links given, but when I use requests they return Javascript and not the actual links as when you click them in a browser.

What are these links google uses in the xml rss feed. And what is the result when using them in requests.get. Ultimately I want to know what's the best way to either get the actual link and scrape them?

So far I am able to Parse the xml file of https://news.google.com/rss/search?q={"Example_Country"}.

But when I try the following approach of:

`headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

response = requests.get(url, headers=headers, allow_redirects=True)
response.encoding = 'utf-8'`

I have no idea what is returned. I was expected to be redirected to the actual url.

Upvotes: 0

Views: 58

Answers (1)

GTK
GTK

Reputation: 1906

google news uses javascript to redirect to the article, you can get the article's url by replicating the post request:

import requests
import json
from bs4 import BeautifulSoup

google_rss_url = 'https://news.google.com/rss/articles/CBMiogFBVV95cUxQLUFubVFtajlobFdKQllBb1JBRDNPNm8yRE51a0N2STl3SGd6eFY2cEVMdllnM1VwNGRBbWxsa0FLUGViaHNaN2p2R2oyMWFsNklvM3pFdXZFWjNNT1dNN2lrclRzWTlLOEpOLXYzZzdETnFMS0ZUZktvZTREcmU0ZzZucmFYZk0tR1gzTVlEUWh4dXVpMGMxSWltb2x2enl1TEE?oc=5'

resp = requests.get(google_rss_url)
data = BeautifulSoup(resp.text, 'html.parser').select_one('c-wiz[data-p]').get('data-p')
obj = json.loads(data.replace('%.@.', '["garturlreq",'))

payload = {
    'f.req': json.dumps([[['Fbv4je', json.dumps(obj[:-6] + obj[-2:]), 'null', 'generic']]])
}

headers = {
  'content-type': 'application/x-www-form-urlencoded;charset=UTF-8',
  'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
}

url = "https://news.google.com/_/DotsSplashUi/data/batchexecute"
response = requests.post(url, headers=headers, data=payload)
array_string = json.loads(response.text.replace(")]}'", ""))[0][2]
article_url = json.loads(array_string)[1]

print(article_url)

Upvotes: 1

Related Questions