Rohit
Rohit

Reputation: 6020

Basic HTML with Pattern.web

I am trying to scrap the following information from IMDb:

Desired Output:

$220,000,000 (estimated), $207,438,708 (USA), (4,349 Screens)

I wrote the following code to get the HTML seen below:

from pattern import web
import requests

url_business = url_movie = "http://www.imdb.com/title/tt0848228/business"
business_html = requests.get(url_business)
dom = web.Element(business_html.text)

for business in dom.by_id('tn15content'):
    print business.source 

The output (truncated) looks like this:

<div id="tn15content">


<h5>Budget</h5>
$220,000,000 (estimated)<br/>
<br/>

<h5>Opening Weekend</h5>
$207,438,708 (USA) (<a href="/date/05-06/">6 May</a> <a href="/year/2012/">2012</a>) (4,349 Screens)<br/>&#163;15,778,074 (UK) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (521 Screens)<br/>$178,400,000 (Non-USA) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>)<br/>BRL 20,387,104 (Brazil) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (996 Screens)<br/>$51,640 (Cambodia) (<a href="/date/05-17/">17 May</a> <a href="/year/2012/">2012</a>)<br/>INR 110,000,000 (India) (<a href="/date/04-27/">27 April</a> <a href="/year/2012/">2012</a>)<br/>&#8364;4,752,836 (Italy) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (678 Screens)<br/>PHP 277,383,923 (Philippines) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (479 Screens)<br/>&#8364;468,100 (Portugal) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (80 Screens)<br/>
<br/>

<h5>Gross</h5>

Because the text is not within any tag, I cannot do element.by_tag().content. So how do I get the information?

Upvotes: 0

Views: 407

Answers (1)

gabhijit
gabhijit

Reputation: 3595

Here's what I have got so far - I think it should be easy to take it from here

from pattern import web
import requests
import sys

url = "http://www.imdb.com/title/tt0848228/business"

r = requests.get(url)
if not r.ok:
    sys.exit(-1)

d = web.Element(r.text)

x = d.getElementById('tn15content')

split the text of the Dom element x by .

strs = x.string.split('<h5>')

First two items

print strs[0]
print strs[1]

Here are rest of the elements, split them by <br />

b = strs[2].split(r'<br />')

Get rid of the a href string.

import re
r = re.compile(r'(<a.*a>)')
for i in b:
    print r.sub('', i)

Output: Opening Weekend</h5> $207,438,708 (USA) () (4,349 Screens) &#163;15,778,074 (UK) () (521 Screens) $178,400,000 (Non-USA) () BRL 20,387,104 (Brazil) () (996 Screens) $51,640 (Cambodia) () INR 110,000,000 (India) () &#8364;4,752,836 (Italy) () (678 Screens) PHP 277,383,923 (Philippines) () (479 Screens) &#8364;468,100 (Portugal) () (80 Screens)

I think you can follow this to get desired output.

Upvotes: 1

Related Questions