Reputation: 71004
The database schema is
CREATE TABLE sites
(
site_id INTEGER PRIMARY KEY AUTOINCREMENT,
netloc TEXT UNIQUE NOT NULL,
last_visited REAL DEFAULT 0,
crawl_rate REAL DEFAULT 2,
crawl_frequency REAL DEFAULT 604800,
robots_txt TEXT DEFAULT 0,
robots_last_fetch REAL DEFAULT 0,
allow_all NUMERIC DEFAULT 0,
disallow_all NUMERIC DEFAULT 0,
active NUMERIC DEFAULT 0
)
CREATE TABLE urls
(
url_id INTEGER PRIMARY KEY AUTOINCREMENT,
site_id INTEGER REFERENCES sites (id) NOT NULL,
scheme TEXT NOT NULL,
path TEXT NOT NULL,
last_visited REAL DEFAULT 0,
UNIQUE( site_id, scheme, path)
)
As you can probably, guess, this is for a web crawler.
I want to get N of the sites that have crawlable urls associated with them and all of the
aforementioned urls. A url is crawlable if url.last_visited + site.crawl_frequency < current_time
where current_time comes from pythons time.time()
function. What I'm looking for will probably begin with something like:
SELECT s.*, u.* FROM sites s, urls u ON s.site_id = u.site_id ...
Beyond that all I can think is that GROUP BY
might have some role to play.
Upvotes: 0
Views: 193
Reputation: 5183
Here is a graceless query. There's probably a more clever way to do this.
SELECT s.*, u.*
FROM sites s, urls u ON s.site_id = u.site_id
WHERE s.site_id IN
(SELECT DISTINCT site_id
FROM urls uu INNER JOIN sites ss ON uu.site_id = ss.site_id
WHERE uu.last_visited + ss.crawl_frequency < current_time
ORDER BY ss.site_id
LIMIT n);
The subquery is supposed to return up to n
distinct site_id
s with least one crawlable URL. The ORDER BY
attribute needn't be site_id
. Actually ORDER BY
isn't necessary at all. I just threw it in there because consistency is nice when playing with a new query.
The enclosing query returns all url
s associated with n
distinct site
s, where each site
has at least one crawlable url
. Note that not all url
s returned are necessarily crawlable; the only guarantee is that at least one url
per site
is crawlable. A returned site
could have non-crawlable url
s, too.
If only crawlable url
s should be returned, the timing condition can be copied in the enclosing query. I couldn't tell which behavior was required from the question.
P.S. I'm indulging in pedantry now, but the way crawl_frequency
is used makes me think it could be called crawl_period
or crawl_delay
instead
Upvotes: 1