Reputation: 563
I am working on a Google App Engine project (python/webapp2) where I am a little concerned with people abusing/spamming the service I am creating with a large number of requests. In an attempt to combat this potential, my idea is to limit the number of requests allowed per IP address in any given hour for certain parts of the applicaiton. My current plan is as follows:
On each request I will:
My question is this:
Is this the best way to go about this? I am only a beginner here and I imagine that there is quite a bit of overhead of doing it this way and that possibly this is a common task that might have a better solution. Is there any better way to do this that is less resource intensive?
Upvotes: 3
Views: 4388
Reputation: 57188
In the past, I've done this with memcache, which is much faster, especially since you only really care about approximate limits (approximate because memcache can be flushed by the system, might not be shared by all instances, etc.). You can even use it to expire keys for you. Something like this (which assumes self
is a webapp2 request handler, and you've imported GAE's memcache library):
memcache_key = 'request-count-' + self.request.remote_addr
count = memcache.get(memcache_key)
if count is not None and count > MAX_REQUESTS:
logging.warning("Remote user has %d requests; rejecting." % (count))
self.error(429)
return
count = memcache.incr(memcache_key)
if count is None:
# key didn't exist yet
memcache.add(memcache_key, 1, time=WINDOW_IN_SECONDS)
This will create a key which rejects users after about MAX_REQUESTS in WINDOW_IN_SECONDS time, re-zeroing the count each WINDOW_IN_SECONDS. (i.e. it's not a sliding window; it resets to zero each time period.)
Upvotes: 11
Reputation: 366073
First, two caveats with your design:
It's often very easy for someone to get a new IP address—switch your iPhone from LTE to 3G and back, unplug and replug your DSL model, pick a new open proxy, etc. So, if you're expecting this to prevent intentional abuse rather than just people not realizing they're doing too much, it's not much help.
IP addresses are often shared, either by NAT, or sequentially. Maybe 200 requests per hour per IP seems reasonable if that means one person—but what if it means all 7500 employees at BigCorp's regional office?
Anyway, your solution will work, and, depending on your traffic patterns it may be reasonable, but there are a few alternatives.
For example, instead of checking on every connection, you may want to keep a shared blacklist. When a connection comes in, immediately accept or reject based on that blacklist, and kick off an "update the database" job. You can do further tricks to coalesce the updates, not update more often than once every N seconds, etc. Of course this means you now have shared data that's readable by all connections and writable by some background job, which means you've opened the door to race conditions and deadlocks and all the fun things that Guido tried hard to make sure you rarely have to face with GAE.
You can use memcache instead of dataStore. However, you need to carefully rework your keys so they make sense for a simple key-value store and so expiry does what you want. For example, you might keep a value keyed off the IP plus a timestamp or random number or whatever for each connection, plus a list-of-connections value keyed off the IP that lets you find the other values. Any value that's dropped out of the cache no longer counts, and if the list-of-connections value drops, the user must be down to 0. But this adds a lot of complexity.
If you have a small number of users each making a whole lot of requests, you could use a timer to decrement or reset or re-count for each IP. However, if you expect more than a few hundred distinct IPs per hour, you need to manually coalesce all these timers, and probably coalesce the jobs as well (e.g., "at 17:55:39, decrement this list of 17 IPs"), and the timer will probably be firing so often that it's probably not worth it.
Personally, I'd do the simplest implementation first, then stress-test and performance-test it, and if it's good enough, stop worrying.
And if it's not good enough, I might look into whether I could simplify the design before looking at optimizing the implementation. For example, if it's N connections per IP per calendar-hour, that makes everything a whole lot easier—just store a counter per IP (in dataStore or memcache), and wipe all the counters at every XX:00. Is that acceptable?
Upvotes: 2