Andrey Kite Gorin
Andrey Kite Gorin

Reputation: 1040

Capturing the video stream from a website into a file

For my image classification project I need to collect classified images, and for me a good source would be different webcams around the world streaming video in the internet. Like this one:

https://www.skylinewebcams.com/en/webcam/espana/comunidad-valenciana/alicante/benidorm-playa-poniente.html

I don't really have any experience with video streaming and web scraping generally, so after searching for the info in internet, i came up with this naive code in python:

url='https://www.skylinewebcams.com/a816de08-9805-4cc2-94e6-2daa3495eb99'
r1 = requests.get(url, stream=True)
filename = "stream.avi"

if(r1.status_code == 200):
    with open(filename,'w') as f:
        for chunk in r1.iter_content(chunk_size=1024):
            f.write(chunk)

else:
    print("Received unexpected status code {}".format(r.status_code))

where the url address was taken from the source of the video block from the website:

<video data-html5-video="" 
poster="//static.skylinewebcams.com/_2933625150.jpg" preload="metadata" 
src="blob:https://www.skylinewebcams.com/a816de08-9805-4cc2-94e6- 
2daa3495eb99"></video>

but it does not work (avi file is empty), even though in the browser video streaming is working good. Can anybody explain me how to capture this video stream into the file?

Upvotes: 7

Views: 14978

Answers (2)

OlleMort21
OlleMort21

Reputation: 21

The list turns out empty because you're making an HTTP request without headers (which means you're doing it programmatically for sure) and most sites just respond to those with 403 outright.

You should use a library like Requests or pycurl to add headers to your requests and they should work fine. For an example request (complete with headers), you can open your web browser's developer console while watching streaming, find an HTTP request for the m3u8 url, right-click on it, and "copy as cURL". Note that there are site-specific, arbitrary headers that may be required to be sent with each request.

If you want to scrape multiple sites with different headers, and/or want to future-proof your code for if they change the headers, addresses or formats, then you probably need something more advanced. Worst-case scenario, you might need to run a headless browser to open the site with WebDriver/Selenium and capture the requests it makes to generate your requests.

Keep in mind you might have to read each site's ToS or otherwise you might be performing illegal activities. Scraping while breaking the ToS is basically digital trespassing and I think at least craigslist has already won lawsuits based on that criteria.

Upvotes: 2

Andrey Kite Gorin
Andrey Kite Gorin

Reputation: 1040

I've made some progress since then. Here is the code:

print ("Recording video...")
url='https://hddn01.skylinewebcams.com/02930601ENXS-1523680721427.ts'
r1 = requests.get(url, stream=True)
filename = "stream.avi"

num=0
if(r1.status_code == 200):
    with open(filename,'wb') as f:
        for chunk in r1.iter_content(chunk_size=1024):
            num += 1
            f.write(chunk)
            if num>5000:
                print('end')
                break

else:
    print("Received unexpected status code {}".format(r.status_code))

Now i can get some piece of video written in the file. What I've change is 1) in open(filename,'wb') changed 'w' to 'wb' to write binary data, but most important 2) changed url. I looked in Chrome devtools 'network' what requests are sent by browser to get the live stream, and just copied the most fresh one, it requests some .ts file.

Next, i've found out how to get the addresses of .ts video files. One can use m3u8 module (installable by pip) like this:

import m3u8
m3u8_obj = m3u8.load('https://hddn01.skylinewebcams.com/live.m3u8? 
                        a=k2makj8nd279g717kt4d145pd3')
playlist=[el['uri'] for el in m3u8_obj.data['segments']]

The playlist of the video files will then be something like that

['https://hddn04.skylinewebcams.com/02930601ENXS-1523720836405.ts',
 'https://hddn04.skylinewebcams.com/02930601ENXS-1523720844347.ts',
 'https://hddn04.skylinewebcams.com/02930601ENXS-1523720852324.ts',
 'https://hddn04.skylinewebcams.com/02930601ENXS-1523720860239.ts',
 'https://hddn04.skylinewebcams.com/02930601ENXS-1523720868277.ts',
 'https://hddn04.skylinewebcams.com/02930601ENXS-1523720876252.ts']

and I can download each of the video files from the list.

The only problem left, is that in order to load the playlist i need first to open the webpage in a browser. Otherwise the playlist is gonna be empty. Probably opening the webpage initiates the streaming and this creates m3u8 file on the server that can be requested. I still don't know how to initialize streaming from python, without opening the page in the browser.

Upvotes: 6

Related Questions