alphamonkey
alphamonkey

Reputation: 249

Python BeautifulSoup webcrawling: Appending piece of data to list

The site I am trying to crawl is http://www.boxofficemojo.com/yearly/chart/?yr=2013&p=.htm. The specific page I'm focusing on now is http://www.boxofficemojo.com/movies/?id=catchingfire.htm.

I need to get the "Foreign gross" amount (under Total Lifetime Grosses), but for some reason I'm not able to get it through a loop so that it goes through for all the movies, but it works with a single link that I type in.

This is my function to get this amount for each movie.

def getForeign(item_url):
    s = urlopen(item_url).read()
    soup = BeautifulSoup(s)
    return soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)

This is the function to loop through each link

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in soup.select('td > b > font > a[href^=/movies/?]'):
            href = 'http://www.boxofficemojo.com' + link.get('href')
            details(href)
            listOfDirectors.append(getDirectors(href))
            str(listOfDirectors).replace('[','').replace(']','')
            #getActors(href)
            title = link.string
            listOfTitles.append(title)
        page += 1

I have a list called listOfForeign = [] that I want to append every movie's foreign gross amount to. The problem is, if I call getForeign(item_url) using a single full link that I type in such as:

print listOfForeign.append(getForeign(http://www.boxofficemojo.com/movies/?id=catchingfire.htm))

and then later

print listOfForeign

it prints out the one correct amount.

But when I run the function spider(max_pages), and add:

listOfForeign.append(getForeign(href)) 

inside the for loop, and later try to print the listOfForeign out, I get an error

AttributeError: 'NoneType' object has no attribute 'find_parent'

Why am I not able to successfully add this amount for each movie inside the spider function? In the spider(max_pages) function I get each of the movie's links in the variable "href", and essentially doing the same thing as adding each individual link separately.

Complete code:

import requests
from bs4 import BeautifulSoup
from urllib import urlopen
import xlwt
import csv
from tempfile import TemporaryFile

listOfTitles = []
listOfGenre = []
listOfRuntime = []
listOfRatings = []
listOfBudget = []
listOfDirectors = []
listOfActors = []
listOfForeign = []
resultFile = open("movies.csv",'wb')
wr = csv.writer(resultFile, dialect='excel')

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in soup.select('td > b > font > a[href^=/movies/?]'):
            href = 'http://www.boxofficemojo.com' + link.get('href')
            details(href)
            listOfForeign.append(getForeign(href))
            listOfDirectors.append(getDirectors(href))
            str(listOfDirectors).replace('[','').replace(']','')
            #getActors(href)
            title = link.string
            listOfTitles.append(title)
        page += 1


def getDirectors(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    tempDirector = []
    for director in soup.select('td > font > a[href^=/people/chart/?view=Director]'):
        tempDirector.append(str(director.string))
    return tempDirector

def getActors(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    tempActors = []
    print soup.find(text="Actors:").find_parent("tr").text[7:]



def details(href):
    response = requests.get(href)
    soup = BeautifulSoup(response.content)
    genre = soup.find(text="Genre: ").next_sibling.text
    rating = soup.find(text='MPAA Rating: ').next_sibling.text
    runtime = soup.find(text='Runtime: ').next_sibling.text
    budget = soup.find(text='Production Budget: ').next_sibling.text

    listOfGenre.append(genre)
    listOfRuntime.append(runtime)
    listOfRatings.append(rating)
    listOfBudget.append(budget)


def getForeign(item_url):
    s = urlopen(item_url).read()
    soup = BeautifulSoup(s)
    try:
        return     soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)
    except AttributeError:
        return "$0"

spider(1)

print listOfForeign
wr.writerow(listOfTitles)
wr.writerow(listOfGenre)
wr.writerow(listOfRuntime)
wr.writerow(listOfRatings)
wr.writerow(listOfBudget)
for item in listOfDirectors:
    wr.writerow(item)

Upvotes: 2

Views: 1264

Answers (1)

alecxe
alecxe

Reputation: 474071

The code fails once it hits a movie page without a foreign income, as e.g. 42. You should handle cases like this. For example, catch the exception and set it to $0.

You are also experiencing the differences between parsers - specify the lxml or html5lib parser explicitly (you would need to have lxml or html5lib installed).

Also, why don't use requests to parse the movie page too:

def getForeign(item_url):
    response = requests.get(item_url)
    soup = BeautifulSoup(response.content, "lxml")  # or BeautifulSoup(response.content, "html5lib")
    try:
        return soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)
    except AttributeError:
        return "$0"

As a side note, overall the code you have is getting pretty complex and slow, due to the blocking nature of the script, the requests are sent one-by-one sequentially. It might be a good idea to switch to the Scrapy web-scraping framework, that, aside from making the code a lot faster, would help to organize it into logical groups - you would have a spider with the scraping logic inside, item class defining your extraction data model, a pipeline for writing the extracted data to a database, if needed and much more.

Upvotes: 2

Related Questions