Reputation: 87
I am new to WebScraping/Python and BeautifulSoup and am having difficulty getting my code to work.
I would like to scrape the url: http://m.imdb.com/feature/bornondate" to get the:
for the ten celebrities on that page. I am not sure what I am doing wrong.
Here is my code:
import urllib2
from bs4 import BeautifulSoup
url = 'http://m.imdb.com/feature/bornondate'
test_url = urllib2.urlopen(url)
readHtml = test_url.read()
test_url.close()
soup = BeautifulSoup(readHtml)
# Using it track the number of Actor
count = 0
# Fetching the value present within tag results
person = soup.findChildren('section', 'posters list')
# Changing the person into an iterator
iterperson = iter(person[0].findChildren('a'))
# Finding 'a' in iterperson. Every 'a' tag contains information of a person
for a in iterperson:
imgSource = a.find('img')['src'].split('._V1.')[0] + '._V1_SX214_AL_.jpg'
person = a.findChildren('div', 'label')
title = person[0].find('span', 'title').contents[0]
##profession = person[0].find('div', 'detail').contents[0].split(,)
##bestWork = person[0].find('div', 'detail').contents[1].split(,)
print '*******************************IMDB People Born Today***********************************'
# Printing the S.No of the person
print 'S.No. --> ',
count += 1
print count
# Printing the title/name of the person
print 'Title --> ' + title
# Printing the Image Source of the person
print 'Image Source --> ', imgSource
# Printing the Profession of the person
##print 'Profession --> ', profession
# Printing the Best work of the person
##print 'Best Work --> ', bestWork
Currently nothing is getting printed out. Also if this to vague could you explain how to do just Name of Celebrity for instance?
Here is the first celebrity's html code if that helps:
<section class="posters list">
<h1>March 7</h1>
<a href="/name/nm0186505/" class="poster "><img src="http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1._CR0,0,1369,2019_SX40_SY59.jpg" style="background:url('http://i.media-imdb.com/images/mobile/people-40x59-fade.png')" width="40" height="59"><div class="label"><span class="title">Bryan Cranston</span><div class="detail">Actor, "Ozymandias"</div></div></a>
Upvotes: 4
Views: 3074
Reputation: 362
I am working on same assignment. URLlib library loads static content of web URL. Use selenium to get complete html which includes dynamic content too. If you use urllib2 library, generated html would be
<span class="loading"></span>
Hope it helps.
Upvotes: 0
Reputation: 473893
First of all, screen scraping is explicitly forbidden by the IMDb "Conditions of Use":
Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent as noted below.
Try exploring the IMDb JSON API instead of a web-scraping approach.
Your current problem is - the list of people born on the specific date is loaded via a separate call to the IMDb
API and with a javascript logic involved.
The easiest option right now would be to switch to selenium
browser automation tool. Working example using a headless PhantomJS
browser:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://m.imdb.com/feature/bornondate")
# waiting for posters to load
wait = WebDriverWait(driver, 10)
posters = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "section.posters")))
# extracting the data poster by poster
for a in posters.find_elements_by_css_selector('a.poster'):
img = a.find_element_by_tag_name('img').get_attribute('src').split('._V1.')[0] + '._V1_SX214_AL_.jpg'
person = a.find_element_by_css_selector('div.detail').text
title = a.find_element_by_css_selector('span.title').text
print img, person, title
Prints:
http://ia.media-imdb.com/images/M/MV5BMTA2NjEyMTY4MTVeQTJeQWpwZ15BbWU3MDQ5NDAzNDc@._V1_SX214_AL_.jpg Actor, "Ozymandias" Bryan Cranston
http://ia.media-imdb.com/images/M/MV5BNjUxNjcxMjE4N15BMl5BanBnXkFtZTgwNDk4NjA2MzE@._V1_SX214_AL_.jpg Actress, "Karla" Laura Prepon
http://ia.media-imdb.com/images/M/MV5BMTQ4MzM1MDAwMV5BMl5BanBnXkFtZTcwNTU4NzQwMw@@._V1_SX214_AL_.jpg Actress, "The Mummy" Rachel Weisz
http://ia.media-imdb.com/images/M/MV5BMjE0Mjg0NzE2Nl5BMl5BanBnXkFtZTcwMDE1MTkxMw@@._V1_SX214_AL_.jpg Actor, "Jarhead" Peter Sarsgaard
http://ia.media-imdb.com/images/M/MV5BMTMyOTYzODQ5MF5BMl5BanBnXkFtZTcwMjE3MDgzMQ@@._V1_SX214_AL_.jpg Actress, "Blades of Glory" Jenna Fischer
http://ia.media-imdb.com/images/M/MV5BMzE2OTAwNzM0Ml5BMl5BanBnXkFtZTcwNzE1MDg0Mw@@._V1_SX214_AL_.jpg Actress, "Tangled" Donna Murphy
http://ia.media-imdb.com/images/M/MV5BMTI0OTMzMzE0N15BMl5BanBnXkFtZTcwMjI1MzYyMQ@@._V1_SX214_AL_.jpg Actor, "How the Grinch Stole Christmas" T.J. Thyne
http://ia.media-imdb.com/images/M/MV5BNzczODkyNzY4OV5BMl5BanBnXkFtZTcwNTU0NjQzMQ@@._V1_SX214_AL_.jpg Actor, "Home Alone" John Heard
http://ia.media-imdb.com/images/M/MV5BMTg4MjU2MzA2OV5BMl5BanBnXkFtZTgwOTIxMjc4MjE@._V1_SX214_AL_.jpg Actress, "Beerfest" Audrey Marie Anderson
http://ia.media-imdb.com/images/M/MV5BMTQyOTc5NzA0M15BMl5BanBnXkFtZTYwODQ2MjYz._V1_SX214_AL_.jpg Producer, "Kick-Ass" Matthew Vaughn
Upvotes: 4