Python webcrawling BeautifulSoup: getting both text and links

Question

The site I am trying to crawl is http://www.boxofficemojo.com/yearly/chart/?yr=2013&p=.htm. The specific page I'm focusing on now is http://www.boxofficemojo.com/movies/?id=catchingfire.htm. From this page, I am having trouble getting two things. First, I need to get the "Foreign gross" amount (under Total Lifetime Grosses). I'm not sure how to do this because when I inspect the element, it doesn't seem to have a specific tag and theres a ton of css tags surrounding it. How can I get this piece of data?

Next, I am trying to get a list of the actors for each movie. I have successfully gotten all the actors that have links attached (by searching for the a href tags), but am not able to get the actors that have no links.

def spider(max_pages):
page = 1
while page <= max_pages:
    url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.select('td > b > font > a[href^=/movies/?]'):
        href = 'http://www.boxofficemojo.com' + link.get('href')
        details(href)

        listOfDirectors.append(getDirectors(href))
        str(listOfDirectors).replace('[','').replace(']','')

        listOfActors.append(getActors(href))
        str(listOfActors).replace('[','').replace(']','')
        getActors(href)
        title = link.string
        listOfTitles.append(title)
    page += 1


def getActors(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
tempActors = []
for actor in soup.select('td > font > a[href^=/people/chart/?view=Actor]'):
    tempActors.append(str(actor.string))
return tempActors

What I am doing in the getActors function is putting each actor for each movie into a temporary list, which then in the spider() function, I append that list to a complete list of every single movie. The current way I am getting the actors is:

for actor in soup.select('td > font > a[href^=/people/chart/?view=Actor]'):
    tempActors.append(str(actor.string))

This obviously doesn't work for the actors with no links. I have tried

for actor in soup.findAll('br', {'class', 'mp_box_content'}):
     tempActors.append(str(actor.string))

but this doesn't work, it doesn't add anything. How can I get all the actors, regardless of whether they have links or not?

alecxe · Accepted Answer

To get the "Foreign Gross", get the element containing "Foreign:" text and locate the next td sibling of the td parent:

In [4]: soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip=True)
Out[4]: u'$440,244,916'

As for actors, a similar technique can be applied: locate the Actors:, find the tr parent and find all text nodes inside (text=True):

In [5]: soup.find(text="Actors:").find_parent("tr").find_all(text=True)[1:]
Out[5]: 
[u'Jennifer Lawrence',
 u'Josh Hutcherson',
 u'Liam Hemsworth',
 u'Elizabeth Banks',
 u'Stanley Tucci',
 u'Woody Harrelson',
 u'Philip Seymour Hoffman',
 u'Jeffrey Wright',
 u'Jena Malone',
 u'Amanda Plummer',
 u'Sam Claflin',
 u'Donald Sutherland',
 u'Lenny Kravitz']

Note that this has proven to work for this particular page. Test it on other movie pages and make sure it produces the desired result.

Python webcrawling BeautifulSoup: getting both text and links

Answers (1)

Related Questions