Reputation: 5859
I have not done too much of web scraping in my experience. So far I am using python and using BeautifulSoup4
to scrape the hackernews page.
Was just wondering if there are patterns I should keep in mind before doing scraping. Right now the code looks very ugly and I feel like a hack.
Code:
import requests
from bs4 import BeautifulSoup
class Command(BaseCommand):
page = {}
td_count = 2
data_count = 0
def handle(self, *args, **options):
for i in range(1,4):
self.page_no = i
self.parse()
print self.page[1]
def get_result(self):
return requests.get('https://news.ycombinator.com/news?p=%s'% self.page_no)
def parse(self):
soup = BeautifulSoup(self.get_result().text, 'html.parser')
for x in soup.find_all('table')[2].find_all('tr'):
self.data_count += 1
self.page[self.data_count] = {'other_data' : None, 'url' : ''}
if self.td_count%3 == 0:
try:
subtext = x.find_all('td','subtext')[0]
self.page[self.data_count - 1]['other_data'] = subtext
except IndexError:
pass
title = x.find_all('td', 'title')
if title:
try:
self.page[self.data_count]['url'] = title[1].a
print title[1].a
except IndexError:
print 'Done page %s'%self.page_no
self.td_count +=1
Upvotes: 2
Views: 2575
Reputation: 617
Actually I behave scrappable data as part of my domain(business) data, which allows me to use Domain Driven Design to structure the problem:
Entities and Value Objects
I use entities and value objects to store the correct extracted information from data into my programming language data structures, so I can work with them in a great way.
Repository Pattern
I use repository pattern to delegate the job of gathering data to a different class. The repository class is given a site and then fetches the data and pre-builds the entities if needed.
Transformer/Presenter pattern
After fetching the data from the repository, I pass the html data to a presenter class. The presenter class has the duty of creating my business entity/value objects from the given HTML string.
Service Layer
If there is more process than those described above, I make a service class which is a wrapper around the problem, It calls the repository , gives the fetched data to the presenter the presenter builds the entities, and done, the result may be used by another service to be stored in a SQL database.
If you are familiar with PHP, I have programmed a small app in Laravel which fetches the alexa rank of a given website each 15mins and notifies the subscribers of that website by Email.
Upvotes: 4