Reputation: 153
Trying to experiment and learn more about sockets for web-scraping.
I am trying to stream information from a website via WebSockets. I was able to receive data but was wondering what would be the correct approach to read and interpret incoming data from it.
I am using Python 3.7. I was able to set up the connection using an example from https://towardsdatascience.com/scraping-in-another-dimension-7c6890a156da
I am trying to get some stock price data to display on https://finance.yahoo.com/quote/BTC-USD/chart via sockets.
This is the code I am using:
import websocket
import json
from websocket import create_connection
headers = json.dumps({
'Accept-Encoding':'gzip deflat,br',
'Accept-Language':'en-US,en;q=0.9,zh-TW;q=0.8,zh;q=0.7,zh-CN;q=0.6',
'Cache-Control': 'no-cache',
'Connection': 'Upgrade',
'Host': 'streamer.finance.yahoo.com',
'Origin': 'https://finance.yahoo.com',
'Pragma': 'no-cache',
'Sec-WebSocket-Extensions': 'permessage-deflate; client_max_window_bits',
'Sec-WebSocket-Key': 'VW2m4Lw2Rz2nXaWO10kxhw==',
'Sec-WebSocket-Version': '13',
'Upgrade': 'websocket',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
})
ws = create_connection('wss://streamer.finance.yahoo.com/',headers=headers)
ws.send('{"subscribe":["^GSPC","^DJI","^IXIC","^RUT","CL=F","GC=F","SI=F","EURUSD=X","^TNX","^VIX","GBPUSD=X","JPY=X","BTC-USD","^FTSE","^N225"]}')
while True:
result = ws.recv()
print(result)
ws.close()
which allows me to get results like these:
CgReREpJFebCzkYYwJHv8LZbKgNESkkwCTgBRWYd6D5I7tDaigFlAOHuQtgBBA==
CgVKUFk9WBUX2ddCGMCR7/C2WyoDQ0NZMA44AUUVH9w+ZQCM7D7YAQg=
CghFVVJVU0Q9WBVA2Yw/GMCR7/C2WyoDQ0NZMA44AUXuDJI+ZQAgTTvYAQg=
CghHQlBVU0Q9WBUQO58/GMCR7/C2WyoDQ0NZMA44AUXz/fY/ZcDrwDzYAQg=
CgReVklYFYXrkUEYgKOB8LZbKgNXQ0IwCTgBRcRWCcBlwMzMvtgBBA==
CghHQlBVU0Q9WBUVOp8/GJCh7/C2WyoDQ0NZMA44AUWcrfY/ZQCtwDzYAQg=
CgVKUFk9WBUv3ddCGJCh7/C2WyoDQ0NZMA44AUVQ7t8+ZQCk8D7YAQg=
CghFVVJVU0Q9WBU424w/GJCh7/C2WyoDQ0NZMA44AUWi2pQ+ZQAQUTvYAQg=
Not sure how to interpret the data I am receiving, or how the web browser interprets this data. It seems to be that the browser is receiving the same data that I am though.
Upvotes: 2
Views: 2317
Reputation: 275
This is indeed a protobuf-encoded data. Maxim could compose a protobuf file from it.
I've created a python package which does it. All you need to do is
pip install yliveticker
import yliveticker
# this function is called on each ticker update
def on_new_msg(msg):
print(msg)
# insert your symbols here
yliveticker.YLiveTicker(on_ticker=on_new_msg, ticker_names=[
"BTC=X", "^GSPC", "^DJI", "^IXIC", "^RUT", "CL=F", "GC=F", "SI=F", "EURUSD=X", "^TNX", "^VIX", "GBPUSD=X", "JPY=X", "BTC-USD", "^CMC200", "^FTSE", "^N225"])
Feel free to contribute or to use the repository as an example for your project ;)
If you don't see any data, check if you are withing trading hours of your stock exchange.
Upvotes: 1
Reputation: 2498
My guess is that this is Protobuf encoded data. You can see by looking at the Javascript source code for the yahoo finance page, once a ticker has been subscribed, the replies are handled by a decoding routine.
https://finance.yahoo.com/__finStreamer-worker.js
... in following snippet, there is a clear conversion from the base64 text to bytes and then to a Javascript object (of type PricingData). Note the mention of protobuf.
QuoteStreamer.prototype.handleWebSocketUpdate = function (event) {
try {
var PricingData = protobuf.roots.default.quotefeeder.PricingData;
var buffer = base64ToArray(event.data); // decode from base 64
var data = PricingData.decode(buffer); // Decode using protobuff
data = PricingData.toObject(data, { // Convert to a JS object
enums: String
});
What you next need to figure out is the Protobuf schema used by Yahoo (which then allows you to generate a decoder in Python), but I'm not sure it is public. However you can inspect the actual Protobuf Javascript code they generated to perform the decoding, and try to directly copy it in Python, or make a guess at the protobuf schema.
The Javascript decoder is here: https://finance.yahoo.com/__finStreamer-proto.js
Upvotes: 2