Deepankar Bajpeyi
Deepankar Bajpeyi

Reputation: 5859

Does web scraping have patterns?

I have not done too much of web scraping in my experience. So far I am using python and using BeautifulSoup4 to scrape the hackernews page.

Was just wondering if there are patterns I should keep in mind before doing scraping. Right now the code looks very ugly and I feel like a hack.

Code:

import requests
from bs4 import BeautifulSoup

class Command(BaseCommand):

page = {}
td_count = 2
data_count = 0

def handle(self, *args, **options):
    for i in range(1,4):
        self.page_no = i
        self.parse()
    print self.page[1]


def get_result(self): 
    return requests.get('https://news.ycombinator.com/news?p=%s'% self.page_no)

def parse(self):
    soup = BeautifulSoup(self.get_result().text, 'html.parser')
    for x in soup.find_all('table')[2].find_all('tr'):
        self.data_count += 1 
        self.page[self.data_count] = {'other_data' : None, 'url' : ''}
        if self.td_count%3 == 0:
            try:
                subtext = x.find_all('td','subtext')[0]
                self.page[self.data_count - 1]['other_data'] = subtext
            except IndexError:
                pass

        title = x.find_all('td', 'title')
        if title:
            try:
                self.page[self.data_count]['url'] = title[1].a
                print title[1].a
            except IndexError:
                print 'Done page %s'%self.page_no
        self.td_count +=1

Upvotes: 2

Views: 2575

Answers (1)

Reza Shadman
Reza Shadman

Reputation: 617

Actually I behave scrappable data as part of my domain(business) data, which allows me to use Domain Driven Design to structure the problem:

Entities and Value Objects

I use entities and value objects to store the correct extracted information from data into my programming language data structures, so I can work with them in a great way.

Repository Pattern

I use repository pattern to delegate the job of gathering data to a different class. The repository class is given a site and then fetches the data and pre-builds the entities if needed.

Transformer/Presenter pattern

After fetching the data from the repository, I pass the html data to a presenter class. The presenter class has the duty of creating my business entity/value objects from the given HTML string.

Service Layer

If there is more process than those described above, I make a service class which is a wrapper around the problem, It calls the repository , gives the fetched data to the presenter the presenter builds the entities, and done, the result may be used by another service to be stored in a SQL database.

If you are familiar with PHP, I have programmed a small app in Laravel which fetches the alexa rank of a given website each 15mins and notifies the subscribers of that website by Email.

Upvotes: 4

Related Questions