Basic HTML with Pattern.web

Question

I am trying to scrap the following information from IMDb:

Budget
Weekend gross (in US)
Screens (associated with weekend gross, US only)

Desired Output:

$220,000,000 (estimated), $207,438,708 (USA), (4,349 Screens)

I wrote the following code to get the HTML seen below:

from pattern import web
import requests

url_business = url_movie = "http://www.imdb.com/title/tt0848228/business"
business_html = requests.get(url_business)
dom = web.Element(business_html.text)

for business in dom.by_id('tn15content'):
    print business.source

The output (truncated) looks like this:




Budget
$220,000,000 (estimated)




Opening Weekend
$207,438,708 (USA) (6 May 2012) (4,349 Screens)
£15,778,074 (UK) (29 April 2012) (521 Screens)
$178,400,000 (Non-USA) (29 April 2012)
BRL 20,387,104 (Brazil) (29 April 2012) (996 Screens)
$51,640 (Cambodia) (17 May 2012)
INR 110,000,000 (India) (27 April 2012)
€4,752,836 (Italy) (29 April 2012) (678 Screens)
PHP 277,383,923 (Philippines) (29 April 2012) (479 Screens)
€468,100 (Portugal) (29 April 2012) (80 Screens)




Gross

Because the text is not within any tag, I cannot do element.by_tag().content. So how do I get the information?

gabhijit · Accepted Answer

Here's what I have got so far - I think it should be easy to take it from here

from pattern import web
import requests
import sys

url = "http://www.imdb.com/title/tt0848228/business"

r = requests.get(url)
if not r.ok:
    sys.exit(-1)

d = web.Element(r.text)

x = d.getElementById('tn15content')

split the text of the Dom element x by .

strs = x.string.split('')

First two items

print strs[0]
print strs[1]

Here are rest of the elements, split them by

b = strs[2].split(r'
')

Get rid of the a href string.

import re
r = re.compile(r'()')
for i in b:
    print r.sub('', i)

Output: Opening Weekend $207,438,708 (USA) () (4,349 Screens) £15,778,074 (UK) () (521 Screens) $178,400,000 (Non-USA) () BRL 20,387,104 (Brazil) () (996 Screens) $51,640 (Cambodia) () INR 110,000,000 (India) () €4,752,836 (Italy) () (678 Screens) PHP 277,383,923 (Philippines) () (479 Screens) €468,100 (Portugal) () (80 Screens)

I think you can follow this to get desired output.

Basic HTML with Pattern.web

Answers (1)

Related Questions