Vincent
Vincent

Reputation: 852

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.

I have this search engine with a list of different urls.

I want to get only the html tag containing a video embed url. and get the link.

example

import BeautifulSoup

html = '''https://archive.org/details/20070519_detroit2'''
    #or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
    #or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''

soup = BeautifulSoup.BeautifulSoup(html)

what should I do next . to get the html tag of video, or object or the exact link of the video..

I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.

Upvotes: 1

Views: 12366

Answers (3)

Serial
Serial

Reputation: 8043

You need to get the html of the page not just the url

use the built-in lib urllib like this:

import urllib
from bs4 import BeautifulSoup as BS

url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")

also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:

https://archive.org/embed/20070519_detroit2

so if you want to do it to multiple urls without having to parse just do something like this:

url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed

EDIT

to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute

for

'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''

just do

soup.find("iframe").get("src")

the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute

You can try the next one because you should learn how to do it if you want to be able to do it in the future :)

Good luck!

Upvotes: 7

B.Mr.W.
B.Mr.W.

Reputation: 19648

this is a one liner to get all the downloadable MP4 file in that page, in case you need it.

import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links

Here are the output:

['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4', 
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']

These are relative links and you and put them together with the domain and you get absolute path.

Upvotes: 1

aIKid
aIKid

Reputation: 28342

You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:

import urllib2

data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")

html = data.read()

Then you can use find, and then take the src attribute:

soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']

Upvotes: 1

Related Questions