Harrison
Harrison

Reputation: 830

Web scraping without knowledge of page structure

I'm trying to teach myself a concept by writing a script. Basically, I'm trying to write a Python script that, given a few keywords, will crawl web pages until it finds the data I need. For example, say I want to find a list of venemous snakes that live in the US. I might run my script with the keywords list,venemous,snakes,US, and I want to be able to trust with at least 80% certainty that it will return a list of snakes in the US.

I already know how to implement the web spider part, I just want to learn how I can determine a web page's relevancy without knowing a single thing about the page's structure. I have researched web scraping techniques but they all seem to assume knowledge of the page's html tag structure. Is there a certain algorithm out there that would allow me to pull data from the page and determine its relevancy?

Any pointers would be greatly appreciated. I am using Python with urllib and BeautifulSoup.

Upvotes: 8

Views: 4513

Answers (2)

Farshid Ashouri
Farshid Ashouri

Reputation: 17711

using a crawler like scrapy (just for handling concurrent downloads), you can write a simple spider like this and probably start with Wikipedia as a good start point. This script is a complete example using scrapy, nltk and whoosh. it will never stop and will index the links for later search using whoosh It's a small Google:

_Author = Farsheed Ashouri
import os
import sys
import re
## Spider libraries
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from main.items import MainItem
from scrapy.http import Request
from urlparse import urljoin
## indexer libraries
from whoosh.index import create_in, open_dir
from whoosh.fields import *
## html to text conversion module
import nltk

def open_writer():
    if not os.path.isdir("indexdir"):
        os.mkdir("indexdir")
        schema = Schema(title=TEXT(stored=True), content=TEXT(stored=True))
        ix = create_in("indexdir", schema)
    else:
        ix = open_dir("indexdir")
    return ix.writer()

class Main(BaseSpider):
    name        = "main"
    allowed_domains = ["en.wikipedia.org"]
    start_urls  = ["http://en.wikipedia.org/wiki/Snakes"]
    
    def parse(self, response):
        writer = open_writer()  ## for indexing
        sel = Selector(response)
        email_validation = re.compile(r'^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,4})$')
        #general_link_validation = re.compile(r'')
        #We stored already crawled links in this list
        crawledLinks    = set()
        titles = sel.xpath('//div[@id="content"]//h1[@id="firstHeading"]//span/text()').extract()
        contents = sel.xpath('//body/div[@id="content"]').extract()
        if contents:
            content = contents[0]
        if titles: 
            title = titles[0]
        else:
            return
        links   = sel.xpath('//a/@href').extract()

        
        for link in links:
            # If it is a proper link and is not checked yet, yield it to the Spider
            url = urljoin(response.url, link)
            #print url
            ## our url must not have any ":" character in it. link /wiki/talk:company
            if not url in crawledLinks and re.match(r'http://en.wikipedia.org/wiki/[^:]+$', url):
                crawledLinks.add(url)
                  #print url, depth
                yield Request(url, self.parse)
        item = MainItem()
        item["title"] = title
        print '*'*80
        print 'crawled: %s | it has %s links.' % (title, len(links))
        #print content
        print '*'*80
        item["links"] = list(crawledLinks)
        writer.add_document(title=title, content=nltk.clean_html(content))  ## I save only text from content.
        #print crawledLinks
        writer.commit()
        yield item

Upvotes: 5

Dan O
Dan O

Reputation: 6090

You're basically asking "how do I write a search engine." This is... not trivial.

The right way to do this is to just use Google's (or Bing's, or Yahoo!'s, or...) search API and show the top n results. But if you're just working on a personal project to teach yourself some concepts (not sure which ones those would be exactly though), then here are a few suggestions:

  • search the text content of the appropriate tags (<p>, <div>, and so forth) for relevant keywords (duh)
  • use the relevant keywords to check for the presence of tags that might contain what you're looking for. For example, if you're looking for a list of things, then a page containing <ul> or <ol> or even <table> might be a good candidate
  • build a synonym dictionary and search each page for synonyms of your keywords too. Limiting yourself to "US" might mean an artificially low ranking for a page containing just "America"
  • keep a list of words which are not in your set of keywords and give a higher ranking to pages which contain the most of them. These pages are (arguably) more likely to contain the answer you're looking for

good luck (you'll need it)!

Upvotes: 3

Related Questions