nlper
nlper

Reputation: 2397

Getting youtube link element from source code

I am observing http://www.bing.com/videos/search?q=kohli and trying to lookup video urls.

Anchor tag contains youtube link, but inside dictionary which I want to extract.

redditFile = urllib2.urlopen("http://www.bing.com/videos?q="+urllib.quote_plus(word))
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
productDivs = soup.findAll('div', attrs={'class' : 'dg_u'})
for div in productDivs:
    print div.find('a')['vrhm'] #This element contains youtube urls but print does not display it
    if div.find('div', {"class":"vthumb", 'smturl': True}) is not None:
        print div.find('div', {"class":"vthumb", 'smturl': True})['smturl'] #this gives link to micro video

How can I get youtube link from a tag and vrhm attribute?

Upvotes: 1

Views: 145

Answers (1)

nu11p01n73R
nu11p01n73R

Reputation: 26667

You can use the json.load to load a a dictionary from json string.

The for loop can be modified as

>>> productDivs = soup.findAll('div', attrs={'class' : 'dg_u'})
>>> for div in productDivs:
...     a_dict = json.loads( div.a['vrhm'] )
...     print a_dict['p']
https://www.youtube.com/watch?v=bWbrWI3PBss
https://www.youtube.com/watch?v=bWbrWI3PBss
https://www.youtube.com/watch?v=PbTx2Fjth-0
https://www.youtube.com/watch?v=pB1Kjx-eheY
..
..

What it does?

  • div.a['vrhm'] extracts the vrhm attribute of the immediate a child of the div.

  • a_dict = json.loads( div.a['vrhm'] ) loads the json string and creates the dictionary a_dict.

  • print a_dict['p'] The a_dict is a python dictionary. Use them as you usually do.

Upvotes: 1

Related Questions