Scraping the web in python

Question

I'm completely new to scraping the web but I really want to learn it in python. I have a basic understanding of python.

I'm having trouble understanding a code to scrape a webpage because I can't find a good documentation about the modules which the code uses.

The code scraps some movie's data of this webpage

I get stuck after the comment "selection in pattern follows the rules of CSS".

I would like to understand the logic behind that code or a good documentation to understand that modules. Is there any previous topic which I need to learn?

The code is the following :

import requests
from pattern import web
from BeautifulSoup import BeautifulSoup

url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
r = requests.get(url)
print r.url

url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012')
r = requests.get(url, params=params)
print r.url  # notice it constructs the full url for you

#selection in pattern follows the rules of CSS

dom = web.Element(r.text)
for movie in dom.by_tag('td.title'):    
    title = movie.by_tag('a')[0].content
    genres = movie.by_tag('span.genre')[0].by_tag('a')
    genres = [g.content for g in genres]
    runtime = movie.by_tag('span.runtime')[0].content
    rating = movie.by_tag('span.value')[0].content
    print title, genres, runtime, rating

haferje · Accepted Answer

Here's the documentation for BeautifulSoup, which is an HTML and XML parser.

The comment

selection in pattern follows the rules of CSS

means the strings such as 'td.title' and 'span.runtime' are CSS selectors that help find the data you are looking for, where td.title searches for the element with attribute class="title".

The code is iterating through the HTML elements in the webpage body and extracting title, genres, runtime, and rating by the CSS selectors .

Scraping the web in python

Answers (1)

Related Questions