Ochmar
Ochmar

Reputation: 294

Django, scraping: What's the best way to detect "changes" while scraping?

This is not typical code problem, instead, it's a design problem I'm facing right now.


Let's say I do have a webpage (which is not mine), and I'd like to scrape a few pieces of information. Most important information, for me, would be when (datetime) character logged in, and when he logged off, but I collect other information as well. Login is known from point 2(see below), but logout i have to calculate I can access 2 pages:

  1. http://x/online.php - It gives me list of online nicknames (200 - 500 entries)
  2. http://x/character.php?name=nickname - it gives me details of each nickname like: Character name, Guild, Sex, Level, Class (Vocation), Status (offline/online), Last login.

I make only 2 "operations" in tasks.py which are:


So, how it works now is (each minute, thanks to Celery, I do this):


And the problem is, that I'm unsure if that's good idea. My models.py: (Comments are just to clarify what I'm doing)

class Guild:
     name = models.CharField(max_length=100)

class Player(models.Model):
    #FK:
    guild =  models.CharField(max_length=50, null=True, blank=True) # Does he have guild?
    name = models.CharField(max_length=100, unique=True) 
    sex = models.CharField(choices=SEX_CHOICES, max_length=7) # Male / Female
    level = models.PositiveSmallIntegerField() 
    vocation = models.CharField(choices=VOCATION_CHOICES, max_length=50) # His class
    status = models.CharField(choices=ONLINE_CHOICES, max_length=10) # Offline / Online
    lastlogin = models.DateTimeField() 

    def __str__(self):
        return self.name


class Deaths(models.Model):
    text = models.CharField(max_length=500)
    killed = models.ForeignKey(Player, null=True, on_delete=models.CASCADE, related_name='killed') # Who got killed
    killer = models.ForeignKey(Player, null=True, on_delete=models.CASCADE, related_name='killer') # Who killed him

    date = models.DateTimeField() # When he died?
    level = models.PositiveSmallIntegerField() # On which level player died
    pvp = models.BooleanField() # Death was due to PvP or PvE?

    class Meta:
        ordering = ('date',)


class OnlineDetails(models.Model):
    player = models.ForeignKey(Player, on_delete=models.CASCADE)

    login = models.DateTimeField() # When he logged in
    logout = models.DateTimeField(null=True, blank=True) # When he logged off

    def __str__(self):
        return self.player.name + " " + str(self.logout) if self.logout else self.player.name

    class Meta:
        ordering = ('logout', 'login')

it works, but I was wondering if it's best way to do so. Actually, I think that this way it's bad, because I have to scan over ~500 characters in one minute which makes it hard with "antyddos" shield.

Do you have any better solution or technology I should pick up? I'm not best in python nor django, still learning.

Upvotes: 1

Views: 212

Answers (1)

Kryštof Řeháček
Kryštof Řeháček

Reputation: 2483

Sure you can measure the whole process, how long it takes and so on but I think updating ~500 entries takes few millisecs. Bigger problem could be the scraping of 500 entries every minute, that means you have to send them about ~8 requests per second (based on point 2. not point 1.). I think you are scraping point 1. every minute and on change you scrape the missing characters. Point 1 is notp roblem at all. Parsing so many pages could be hard but not impossible. Also I suggest you to download the pages and store them for some period of time if anything fails during the process also it's faster to download the pages and in other thread parse them parallely, because the most time difficult is sending the request and downloading the response. To the transaction autocommit... it could be problem in multithreaded envinment. You should try measuring the process with and without it if it is worth the risk of not knowing what is happening.

Upvotes: 1

Related Questions