user2493235
user2493235

Reputation:

Data Structure/Algo for a Queue with Dynamic Limits Based on Attribute

I have a stack (implemented as an array) that contains HTTP requests to be made as sockets become available, with the number of active sockets being limited. I would like to expand this so that there is a maximum number of sockets per host (and a maximum number of total sockets, but that doesn't particularly come in to my question here).

So the queue should continue to be processed in the order that they were received. But of course, if the maximum number of sockets has been reached for the host of the next request in the queue, it's not going to be possible to service it, so the next host in the queue that has not reached maximum sockets should be taken.

I looked at using a Priority Queue, with a comparator to check for the max sockets available on the hosts, but this doesn't really do the job. I want to take the next one in the queue that can be serviced, not reorder the queue based on socket availability as a priority metric.

I thought about a queue per host, but then it's difficult to maintain the original order.

I'm thinking to have a single queue, an attribute for the host on each item, and a routine to go through the queue until it finds the first one that has available sockets, and then dequeue it by splicing. This maintains the original order but seems like it could be inefficient.

So I'm thinking to combine the approaches with something like this (maintaining the overall queue with an "order" attribute):

const queues = [
  {
    host: 'www.example.org',
    queue: [
      { order: 1 },
      { order: 3 }
    ]
  },
  {
    host: 'www.example.com',
    queue: [
      { order: 2 },
      { order: 4 },
      { order: 5 }
    ]
  }
];

With the above approach, an order attribute would be added to each request as it is added to the appropriate queue for its host. Then, each time a new item is needed, the set of host queues can be sorted based on the order value of its first item. Then the check for the next item only needs to run once per host, instead of scanning the entire queue each time.

Upvotes: 0

Views: 59

Answers (1)

Jim Mischel
Jim Mischel

Reputation: 134015

I did something like this in the past for a web crawler.

I had a Host class that contained information about the host: name, maximum number of concurrent requests, the current number of active requests, a copy of its robots.txt file, statistics about its history (number of requests I made to it, average response speed, error rate, etc.), and other host-specific information.

I also had a priority queue of requests. Each request structure had the URL to be visited, and a reference to the corresponding Host instance. The priority key was a combination of a priority value based on the URL's value (computed by a machine learning algorithm that isn't really relevant here), and the time.

When I removed a request from the queue, the first thing I would do is check the Host to see if there were sockets available. If not, I would just re-queue the request with a time value of now + host's average request time.

That worked, although urls for some very busy hosts tended to get recycled often.

I experimented with a priority queue of hosts. Each host had a queue or URLs. There also was a timeout list: a dictionary of hosts that were currently in "timeout" for various reasons, but mostly because there were no available sockets, or its URL queue was empty. Here's how it worked:

A host would be removed from the priority queue, and a request made. If the host still had available sockets, then I would add it back to the queue. If not, it would go into the timeout queue. In either case, when the request completed, the host's number of available sockets would be increased, the host removed from the timeout list and re-inserted into the priority queue.

That approach looked promising. We were testing it when the project was canceled for other reasons.

Upvotes: 1

Related Questions