Reputation: 249
The site I am trying to crawl is http://www.boxofficemojo.com/yearly/chart/?yr=2013&p=.htm. The specific page I'm focusing on now is http://www.boxofficemojo.com/movies/?id=catchingfire.htm.
I need to get the "Foreign gross" amount (under Total Lifetime Grosses), but for some reason I'm not able to get it through a loop so that it goes through for all the movies, but it works with a single link that I type in.
This is my function to get this amount for each movie.
def getForeign(item_url):
s = urlopen(item_url).read()
soup = BeautifulSoup(s)
return soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)
This is the function to loop through each link
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.select('td > b > font > a[href^=/movies/?]'):
href = 'http://www.boxofficemojo.com' + link.get('href')
details(href)
listOfDirectors.append(getDirectors(href))
str(listOfDirectors).replace('[','').replace(']','')
#getActors(href)
title = link.string
listOfTitles.append(title)
page += 1
I have a list called listOfForeign = [] that I want to append every movie's foreign gross amount to. The problem is, if I call getForeign(item_url) using a single full link that I type in such as:
print listOfForeign.append(getForeign(http://www.boxofficemojo.com/movies/?id=catchingfire.htm))
and then later
print listOfForeign
it prints out the one correct amount.
But when I run the function spider(max_pages), and add:
listOfForeign.append(getForeign(href))
inside the for loop, and later try to print the listOfForeign out, I get an error
AttributeError: 'NoneType' object has no attribute 'find_parent'
Why am I not able to successfully add this amount for each movie inside the spider function? In the spider(max_pages) function I get each of the movie's links in the variable "href", and essentially doing the same thing as adding each individual link separately.
Complete code:
import requests
from bs4 import BeautifulSoup
from urllib import urlopen
import xlwt
import csv
from tempfile import TemporaryFile
listOfTitles = []
listOfGenre = []
listOfRuntime = []
listOfRatings = []
listOfBudget = []
listOfDirectors = []
listOfActors = []
listOfForeign = []
resultFile = open("movies.csv",'wb')
wr = csv.writer(resultFile, dialect='excel')
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.select('td > b > font > a[href^=/movies/?]'):
href = 'http://www.boxofficemojo.com' + link.get('href')
details(href)
listOfForeign.append(getForeign(href))
listOfDirectors.append(getDirectors(href))
str(listOfDirectors).replace('[','').replace(']','')
#getActors(href)
title = link.string
listOfTitles.append(title)
page += 1
def getDirectors(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
tempDirector = []
for director in soup.select('td > font > a[href^=/people/chart/?view=Director]'):
tempDirector.append(str(director.string))
return tempDirector
def getActors(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
tempActors = []
print soup.find(text="Actors:").find_parent("tr").text[7:]
def details(href):
response = requests.get(href)
soup = BeautifulSoup(response.content)
genre = soup.find(text="Genre: ").next_sibling.text
rating = soup.find(text='MPAA Rating: ').next_sibling.text
runtime = soup.find(text='Runtime: ').next_sibling.text
budget = soup.find(text='Production Budget: ').next_sibling.text
listOfGenre.append(genre)
listOfRuntime.append(runtime)
listOfRatings.append(rating)
listOfBudget.append(budget)
def getForeign(item_url):
s = urlopen(item_url).read()
soup = BeautifulSoup(s)
try:
return soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)
except AttributeError:
return "$0"
spider(1)
print listOfForeign
wr.writerow(listOfTitles)
wr.writerow(listOfGenre)
wr.writerow(listOfRuntime)
wr.writerow(listOfRatings)
wr.writerow(listOfBudget)
for item in listOfDirectors:
wr.writerow(item)
Upvotes: 2
Views: 1264
Reputation: 474071
The code fails once it hits a movie page without a foreign income, as e.g. 42. You should handle cases like this. For example, catch the exception and set it to $0
.
You are also experiencing the differences between parsers - specify the lxml
or html5lib
parser explicitly (you would need to have lxml
or html5lib
installed).
Also, why don't use requests
to parse the movie page too:
def getForeign(item_url):
response = requests.get(item_url)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
try:
return soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)
except AttributeError:
return "$0"
As a side note, overall the code you have is getting pretty complex and slow, due to the blocking nature of the script, the requests are sent one-by-one sequentially. It might be a good idea to switch to the Scrapy
web-scraping framework, that, aside from making the code a lot faster, would help to organize it into logical groups - you would have a spider with the scraping logic inside, item class defining your extraction data model, a pipeline for writing the extracted data to a database, if needed and much more.
Upvotes: 2