Jay Gattuso
Jay Gattuso

Reputation: 4130

python - repeating functions in a class

First of all, please forgive me if I use the wrong terminology or I have a flaw in my understanding of the basic concepts. I'm just learning how to build scripts that use classes... and am a relative newbie to Python/coding.

In principle, I'm interested to know if there is a way of calling the same func twice in a class. If I briefly explain what I am trying to do it might make some sense...

I am trying to write a web 'scraper' that parses a seeded webpage, returns URLs based on some given parameters, follows those URLs, does the same thing (potentially n times) and finally returns a pdf at the bottom of the link. This is for speeding up some content collecting my colleagues currently do manually. (I have saved literal person months of manual effort with my relatively basic previous iterations.)

This is a method I currently use, but the code I wrote is not really scaleable or easily reusable, and I want to try and make it more versatile (I'm currently hand rolling the script for each instance).

(I think) I want to build a class called siteInstance that I use to hold the seed url, save locations, titles, u:p, cookie and the various funcs. that I use to walk through the site to get to the target content.

There is a repeated function I use that parses the target URL and returns the next layer URLs. These are based on some site specific RegEx, so I know I will have to feed the search filter (regEx) for each layer of URLs. I would like to be able to reuse the parser, but feed it with the layer specific Regex. DRY is a thing right?

In my mind, this means I have a func called siteInstance.parser, that I construct a number of within each siteInstance.class (e.g. siteInstance.parserA for layer one, siteInstance.parserB for layer two ... siteInstance.parsern for the nth layer)

Follows is a simplified version, in reality there is a number of cleaning/preparation steps at each layer to result in the correct generation of a list of target URLs for the next layer. This includes making of a file structure for saved binaries, writing logs, and firing the RegEx for that layer etc.

This is a two layer example, but I know of instances that have at least 4 layers to content with.

Example: Seed {URL:www.journalTitle.com}

Result of first pass (Layer1): [{IssueURL2010:www.journalTitle.com/2010},{IssueURL2011:www.journalTitle.com/2011},{IssueURL2012:www.journalTitle.com/2012},{IssueURLn:www.journalTitle.com/n}]

For IssueURL2010 (Layer 2): [{article1_2010URL:www.journalTitle.com/2010/1},{article2_2010URL:www.journalTitle.com/2010/2},{article3_2010URL:www.journalTitle.com/2010/3},{articlen_2010URL:www.journalTitle.com/2010/n}]

From article1_2010UR I can get www.journalTitle.com/2010/1.pdf

I hope this makes sense...

Upvotes: 1

Views: 709

Answers (1)

nneonneo
nneonneo

Reputation: 179422

You could define the parsing logic in a separate class, and just instantiate it a few times in an instance (or class) attribute:

class URLParser(object):
    def __init__(self, regexp, ...):
        self.regexp = regexp
        ...

    def parse_urls(self, urls):
        # do your URL parsing thing
        # return parsed URLs

class SiteInstance(object):
    def __init__(self, ...):
        self.parsers = [
            URLParser('regexp1'),
            URLParser('regexp2'),
            ...
        ]

    def parse(self, ...):
        ...
        for parser in self.parsers:
            parser.parse_urls(...)

Upvotes: 2

Related Questions