Reputation: 2185
I'm completely new to scraping the web but I really want to learn it in python. I have a basic understanding of python.
I'm having trouble understanding a code to scrape a webpage because I can't find a good documentation about the modules which the code uses.
The code scraps some movie's data of this webpage
I get stuck after the comment "selection in pattern follows the rules of CSS".
I would like to understand the logic behind that code or a good documentation to understand that modules. Is there any previous topic which I need to learn?
The code is the following :
import requests
from pattern import web
from BeautifulSoup import BeautifulSoup
url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
r = requests.get(url)
print r.url
url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012')
r = requests.get(url, params=params)
print r.url # notice it constructs the full url for you
#selection in pattern follows the rules of CSS
dom = web.Element(r.text)
for movie in dom.by_tag('td.title'):
title = movie.by_tag('a')[0].content
genres = movie.by_tag('span.genre')[0].by_tag('a')
genres = [g.content for g in genres]
runtime = movie.by_tag('span.runtime')[0].content
rating = movie.by_tag('span.value')[0].content
print title, genres, runtime, rating
Upvotes: 1
Views: 1358
Reputation: 1003
Here's the documentation for BeautifulSoup, which is an HTML and XML parser.
The comment
selection in pattern follows the rules of CSS
means the strings such as 'td.title'
and 'span.runtime'
are CSS selectors that help find the data you are looking for, where td.title
searches for the <TD>
element with attribute class="title"
.
The code is iterating through the HTML elements in the webpage body and extracting title, genres, runtime, and rating by the CSS selectors .
Upvotes: 1