multigoodverse
multigoodverse

Reputation: 8072

How to scrape real time streaming data with Python?

I was trying to scrape the number of flights for this webpage https://www.flightradar24.com/56.16,-49.51

The number is highlighted in the picture below: enter image description here

The number is updated every 8 seconds.

This is what I tried with BeautifulSoup:

import requests
from bs4 import BeautifulSoup
import time

r=requests.get("https://www.flightradar24.com/56.16,-49.51")
c=r.content
soup=BeautifulSoup(c,"html.parser")
value=soup.find_all("span",{"class":"choiceValue"})
print(value)

But that always returns 0:

[<span class="choiceValue" id="menuPlanesValue">0</span>]

View source also shows 0, so I understand why BeautifulSoup returns 0 too.

Anyone know any other method to get the current value?

Upvotes: 6

Views: 30617

Answers (4)

Ruben
Ruben

Reputation: 38

Note: This is an old question. My answer is for future readers.

If the accepted answer from Andres is still correct, then the data is not really streamed but rather fetched from the api. I have posted an example of how to scrape data from a live website using Selenium and a WebSocket server.

The steps are:

  1. Open the website with Selenium (or similar software).

  2. Run a WebSocket server to collect and process the data.

  3. Inject JavaScript to the website that:

    i) Connects to the WebSocket server.

    ii) Observes the data changes (using MutationObserver) and sends them to the WebSocket.

  4. Process the data in your application, such as aggregating and storing it.

You can find the minimal example here: https://github.com/rbnbr/LiveWebsiteScraper/blob/main/minimal_example.py

This method avoids many GET requests by reusing the existing connections unless the website does them itself.

Upvotes: 0

Reza
Reza

Reputation: 141

You can use selenium to crawl a webpage with dynamic content added by javascript.

from bs4 import BeautifulSoup
from selenium import webdriver

browser = webdriver.PhantomJS()
browser.get('https://www.flightradar24.com/56.16,-49.51/3')

soup = BeautifulSoup(browser.page_source, "html.parser")
result = soup.find_all("span", {"id": "menuPlanesValue"})

for item in result:
    print(item.text)

browser.quit()

Upvotes: 1

linusg
linusg

Reputation: 6429

So based on what @Andre has found out, I wrote this code:

import requests
from bs4 import BeautifulSoup
import time

def get_count():
    url = "https://data-live.flightradar24.com/zones/fcgi/feed.js?bounds=59.09,52.64,-58.77,-47.71&faa=1&mlat=1&flarm=1&adsb=1&gnd=1&air=1&vehicles=1&estimated=1&maxage=7200&gliders=1&stats=1"

    # Request with fake header, otherwise you will get an 403 HTTP error
    r = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})

    # Parse the JSON
    data = r.json()
    counter = 0

    # Iterate over the elements to get the number of total flights
    for element in data["stats"]["total"]:
        counter += data["stats"]["total"][element]

    return counter

while True:
    print(get_count())
    time.sleep(8)

The code should be self explaining, everything it does is printing the actual flight count every 8 seconds :)

Note: The values are similar to the ones on the website, but not the same. This is because it's unlikely, that the Python script and the website are sending a request at the same time. If you want to get more accurate results, just make a request every 4 seconds for example.

Use this code as you want, extend it or whatever. Hope this helps!

Upvotes: 7

Andr&#233; Laszlo
Andr&#233; Laszlo

Reputation: 15537

The problem with your approach is that the page first loads a view, then performs regular requests to refresh the page. If you look at the network tab in the developer console in Chrome (for example), you'll see the requests to https://data-live.flightradar24.com/zones/fcgi/feed.js?bounds=59.09,52.64,-58.77,-47.71&faa=1&mlat=1&flarm=1&adsb=1&gnd=1&air=1&vehicles=1&estimated=1&maxage=7200&gliders=1&stats=1

The response is regular json:

{
  "full_count": 11879,
  "version": 4,
  "afefdca": [
    "A86AB5",
    56.4288,
    -56.0721,
    233,
    38000,
    420,
    "0000",
    "T-F5M",
    "B763",
    "N641UA",
    1473852497,
    "LHR",
    "ORD",
    "UA929",
    0,
    0,
    "UAL929",
    0
  ],
  ...
  "aff19d9": [
    "A12F78",
    56.3235,
    -49.3597,
    251,
    36000,
    436,
    "0000",
    "F-EST",
    "B752",
    "N176AA",
    1473852497,
    "DUB",
    "JFK",
    "AA291",
    0,
    0,
    "AAL291",
    0
  ],
  "stats": {
    "total": {
      "ads-b": 8521,
      "mlat": 2045,
      "faa": 598,
      "flarm": 152,
      "estimated": 464
    },
    "visible": {
      "ads-b": 0,
      "mlat": 0,
      "faa": 6,
      "flarm": 0,
      "estimated": 3
    }
  }
}

I'm not sure if this API is protected in any way, but it seems like I can access it without any issues using curl.

More info:

Upvotes: 8

Related Questions