SAKAMOTO
SAKAMOTO

Reputation: 29

Scraping with Python?

I'd like to grab all the index words and its definitions from here. Is it possible to scrape web content with Python?

Firebug exploration shows the following URL returns my desirable contents including both index and its definition as to 'a'.

http://pali.hum.ku.dk/cgi-bin/cpd/pali?acti=xart&arid=14179&sphra=undefined

what are the modules used? Is there any tutorial available?

I do not know how many words indexed in the dictionary. I`m absolute beginner in the programming.

Upvotes: 2

Views: 1142

Answers (2)

Boaz Yaniv
Boaz Yaniv

Reputation: 6424

You can get data from the web using the built-in urllib or urllib2, but the parsing itself is the most important part. May I suggest the wonderful BeautifulSoup? It can deal with just about anything. http://www.crummy.com/software/BeautifulSoup/

The documentation is built like a tutorial. Sorta: http://www.crummy.com/software/BeautifulSoup/documentation.html

In your case, you probably need to use wildcards to see all entries in the dictionary. You can do something like this:

import urllib2

def getArticles(query, start_index, count):
    xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
                          'acti=xsea&tsearch=%s&rfield=entr&recf=%d&recc=%d' %
                          (query, start_index, count))

    # TODO:
    # parse xml code here (using BeautifulSoup or an xml parser like Python's
    # own xml.etree. We should at least have the name and ID for each article.
    # article = (article_name, article_id)

    return (article_names # a list of parsed names from XML

def getArticleContent(article):
    xml = urllib2.urlopen('http://pali.hum.ku.dk/cgi-bin/cpd/pali?' +
                          'acti=xart&arid=%d&sphra=undefined' % article_id)

    # TODO: parse xml
    return parsed_article

Now you can loop over things. For instance, to get all articles starting in 'ana', use the wildcard 'ana*', and loop until you get no results:

query = 'ana*'
article_dict = {}
i = 0
while (true):
    new_articles = getArticles(query, i, 100)
    if len(new_articles) == 0:
        break

    i += 100
    for article_name, article_id in new_articles:
        article_dict[article_name] = getArticleContent(article_id)

Once done, you'll have a dictionary of the content of all articles, referenced by names. I omitted the parsing itself, but it's quite simple in this case, since everything is XML. You might not even need to use BeautifulSoup (even though it's still handy and easy to use for XML).

A word of warning though: You should check the site's usage policy (and maybe robots.txt) before trying to heavily scrap articles. If you're just getting a few articles for yourself they may not care (the dictionary copyright owner, if it's not public domain, may care though), but if you're going to scrape the entire dictionary, this is going to be some heavy usage.

Upvotes: 3

Adam Matan
Adam Matan

Reputation: 136381

You should use urllib2 for gettting the URL contents and BeautifulSoup for parsing the HTML/XML.

Example - retrieving all questions from the StackOverflow.com main page:

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://stackoverflow.com")
soup = BeautifulSoup(page)

for incident in soup('h3'):
    print [i.decode('utf8') for i in incident.contents]
    print

This code sample was adapted from the BeautifulSoup documentation.

Upvotes: 6

Related Questions