Reputation: 6020
I am trying to scrap the following information from IMDb:
Desired Output:
$220,000,000 (estimated), $207,438,708 (USA), (4,349 Screens)
I wrote the following code to get the HTML seen below:
from pattern import web
import requests
url_business = url_movie = "http://www.imdb.com/title/tt0848228/business"
business_html = requests.get(url_business)
dom = web.Element(business_html.text)
for business in dom.by_id('tn15content'):
print business.source
The output (truncated) looks like this:
<div id="tn15content">
<h5>Budget</h5>
$220,000,000 (estimated)<br/>
<br/>
<h5>Opening Weekend</h5>
$207,438,708 (USA) (<a href="/date/05-06/">6 May</a> <a href="/year/2012/">2012</a>) (4,349 Screens)<br/>£15,778,074 (UK) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (521 Screens)<br/>$178,400,000 (Non-USA) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>)<br/>BRL 20,387,104 (Brazil) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (996 Screens)<br/>$51,640 (Cambodia) (<a href="/date/05-17/">17 May</a> <a href="/year/2012/">2012</a>)<br/>INR 110,000,000 (India) (<a href="/date/04-27/">27 April</a> <a href="/year/2012/">2012</a>)<br/>€4,752,836 (Italy) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (678 Screens)<br/>PHP 277,383,923 (Philippines) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (479 Screens)<br/>€468,100 (Portugal) (<a href="/date/04-29/">29 April</a> <a href="/year/2012/">2012</a>) (80 Screens)<br/>
<br/>
<h5>Gross</h5>
Because the text is not within any tag, I cannot do element.by_tag().content
. So how do I get the information?
Upvotes: 0
Views: 407
Reputation: 3595
Here's what I have got so far - I think it should be easy to take it from here
from pattern import web
import requests
import sys
url = "http://www.imdb.com/title/tt0848228/business"
r = requests.get(url)
if not r.ok:
sys.exit(-1)
d = web.Element(r.text)
x = d.getElementById('tn15content')
split the text of the Dom element x
by .
strs = x.string.split('<h5>')
First two items
print strs[0]
print strs[1]
Here are rest of the elements, split them by <br />
b = strs[2].split(r'<br />')
Get rid of the a href
string.
import re
r = re.compile(r'(<a.*a>)')
for i in b:
print r.sub('', i)
Output:
Opening Weekend</h5>
$207,438,708 (USA) () (4,349 Screens)
£15,778,074 (UK) () (521 Screens)
$178,400,000 (Non-USA) ()
BRL 20,387,104 (Brazil) () (996 Screens)
$51,640 (Cambodia) ()
INR 110,000,000 (India) ()
€4,752,836 (Italy) () (678 Screens)
PHP 277,383,923 (Philippines) () (479 Screens)
€468,100 (Portugal) () (80 Screens)
I think you can follow this to get desired output.
Upvotes: 1