Reputation: 574
I saved a search in https://news.google.com/ but google does not use the actual links found on its results page. Rather, you will find links like this:
https://news.google.com/articles/CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA?hl=en-US&gl=US&ceid=US%3Aen
I want the 'real link' that this resolves to using python. If you plug the above url into your browser, for a split second you will see
Opening https://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm
I tried a few things using the Requests module but 'no cigar'.
If it can't be done, are these google links permanent - can they always be used to open up the web page?
UPDATE 1:
After posting this question I used a hack to solve the problem. I simply used urllib again to open up the google url and then parsed the source to find the 'real url'.
It was exciting to see TDG's answer as it would help my program to run faster. But google is being cryptic and it did not work for ever link.
For this mornings news feed, it bombed on the 4th news item:
RESTART: C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py
cp1252
cp1252
>>> 1
Tommy Angelo Presents: The Butoff
CBMiTWh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvdG9tbXktYW5nZWxvLXByZXNlbnRzLXRoZS1idXRvZmYtMzE4ODEuaHRt0gEA
b'\x08\x13"Mhttps://www.pokernews.com/strategy/tommy-angelo-presents-the-butoff-31881.htm\xd2\x01\x00'
Flopped Set of Nines: Get All In on Flop or Wait?
CBMiXGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvZmxvcHBlZC1zZXQtb2YtbmluZXMtZ2V0LWFsbC1pbi1vbi1mbG9wLW9yLXdhaXQtMzE4ODAuaHRt0gEA
b'\x08\x13"\\https://www.pokernews.com/strategy/flopped-set-of-nines-get-all-in-on-flop-or-wait-31880.htm\xd2\x01\x00'
What Not to Do Online: Don’t Just Stop Thinking and Shove
CBMiZWh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd2hhdC1ub3QtdG8tZG8tb25saW5lLWRvbi10LWp1c3Qtc3RvcC10aGlua2luZy1hbmQtc2hvdmUtMzE4NzAuaHRt0gEA
b'\x08\x13"ehttps://www.pokernews.com/strategy/what-not-to-do-online-don-t-just-stop-thinking-and-shove-31870.htm\xd2\x01\x00'
Hold’em with Holloway, Vol. 77: Joseph Cheong Gets Crazy with a Pair of Ladies
CBMiV2h0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvaG9sZC1lbS13aXRoLWhvbGxvd2F5LXZvbC03Ny1qb3NlcGgtY2hlb25nLTMxODU4Lmh0bdIBAA
Traceback (most recent call last):
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 68, in <module>
GetGoogleNews("https://news.google.com/search?q=site%3Ahttps%3A%2F%2Fwww.pokernews.com%2Fstrategy&hl=en-US&gl=US&ceid=US%3Aen", 'news')
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 34, in GetGoogleNews
real_URL = base64.b64decode(coded)
File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\lib\base64.py", line 87, in b64decode
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
>>>
UPDATE 2:
After reading up on base64 I think the 'Incorrect padding' padding message means that the input string must be divisible by 4. So I added 'aa' to
CBMiV2h0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvaG9sZC1lbS13aXRoLWhvbGxvd2F5LXZvbC03Ny1qb3NlcGgtY2hlb25nLTMxODU4Lmh0bdIBAA
and did not get the error message:
>>> t = s + 'aa'
>>> len(t)/4
32.0
>>> base64.b64decode(t)
b'\x08\x13"Whttps://www.pokernews.com/strategy/hold-em-with-holloway-vol-77-joseph-cheong-31858.htm\xd2\x01\x00\x06\x9a'
Upvotes: 6
Views: 5100
Reputation: 333
In 2024, I tried using base64 decode, but it didn't work. So, I found a solution in TypeScript and converted it to Python.
from googlenewsdecoder import gnewsdecoder
def main():
interval_time = 1 # interval is optional, default is None
source_url = "https://news.google.com/read/CBMi2AFBVV95cUxPd1ZCc1loODVVNHpnbFFTVHFkTG94eWh1NWhTeE9yT1RyNTRXMVV2S1VIUFM3ZlVkVjl6UHh3RkJ0bXdaTVRlcHBjMWFWTkhvZWVuM3pBMEtEdlllRDBveGdIUm9GUnJ4ajd1YWR5cWs3VFA5V2dsZnY1RDZhVDdORHRSSE9EalF2TndWdlh4bkJOWU5UMTdIV2RCc285Q2p3MFA4WnpodUNqN1RNREMwa3d5T2ZHS0JlX0MySGZLc01kWDNtUEkzemtkbWhTZXdQTmdfU1JJaXY?hl=en-US&gl=US&ceid=US%3Aen"
try:
decoded_url = gnewsdecoder(source_url, interval=interval_time)
if decoded_url.get("status"):
print("Decoded URL:", decoded_url["decoded_url"])
else:
print("Error:", decoded_url["message"])
except Exception as e:
print(f"Error occurred: {e}")
if __name__ == "__main__":
main()
Upvotes: 4
Reputation: 1
as pointed in this stackoverflow response link, just adding '==' at the end of the string will do the trick
Upvotes: 0
Reputation: 63504
I use the following code which you can put in a new module, e.g. gnews.py
. This answer is applicable to the RSS feeds provided by Google News, and may otherwise need a slight adjustment. Note that I cache the returned value.
Steps used:
"""Decode encoded Google News entry URLs."""
import base64
import functools
import re
# Ref: https://stackoverflow.com/a/59023463/
_ENCODED_URL_PREFIX = "https://news.google.com/__i/rss/rd/articles/"
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)")
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')
@functools.lru_cache(2048)
def _decode_google_news_url(url: str) -> str:
match = _ENCODED_URL_RE.match(url)
encoded_text = match.groupdict()["encoded_url"] # type: ignore
encoded_text += "===" # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
decoded_text = base64.urlsafe_b64decode(encoded_text)
match = _DECODED_URL_RE.match(decoded_text)
primary_url = match.groupdict()["primary_url"] # type: ignore
primary_url = primary_url.decode()
return primary_url
def decode_google_news_url(url: str) -> str: # Not cached because not all Google News URLs are encoded.
"""Return Google News entry URLs after decoding their encoding as applicable."""
return _decode_google_news_url(url) if url.startswith(_ENCODED_URL_PREFIX) else url
Usage example:
>>> decode_google_news_url('https://news.google.com/__i/rss/rd/articles/CBMiQmh0dHBzOi8vd3d3LmV1cmVrYWxlcnQub3JnL3B1Yl9yZWxlYXNlcy8yMDE5LTExL2RwcGwtYmJwMTExODE5LnBocNIBAA?oc=5')
'https://www.eurekalert.org/pub_releases/2019-11/dppl-bbp111819.php'
Upvotes: 3
Reputation: 6161
Basically it is base64 coded string. If you run the following code snippet:
import base64
coded = 'CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA'
url = base64.b64decode(coded)
print(url)
You'll get the following output:
b'\x08\x13"Phttps://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm\xd2\x01\x00'
So it looks like your url with some extras. If all the extras are the same, it will be easy to filter out the url. If not - you'll have to handle every one separately.
Upvotes: 3