El Moreno
El Moreno

Reputation: 425

NodeJS - Force users to wait until global event is completed

I have a Node server that does the following:

I have a list of URLs in an external server, call it URLServer. When a user hits my NODE server, my node server makes a request to the URLServer and gets a list of say 20 URLs. As soon as we get those 20 URLs, I want my node server to go and get the title for each of these URLs, which means that I will fetch the URLs and create a DOM and then extract the title, I also get other data, so this is the way it has to get done. Once I have done that, I want the title of the URLs and the URLs to be saved in internal memory and/or database. So I have a URL-cache and a title-cache (I don't want to fetch the URLs all the time).

I have something like this: if(URL-cache is empty) get URLS from URLServer and cache these URLs

I then want to check each of those URLs to see if their titles are in my cache, so I do: for each URL if title-cache[URL], good else fetch site, create DOM, extract title + other data and cache

This works great for one user, but I when I try a heavy load in the server, the server will hang. I have concluded the server hangs for the following reason:

User 1 Request - Empty Caches - Fetch URLs and when done fetch Content for URLs User 2 Request - The caches still look empty to this user because the request for user 1 has not yet completed!!! Therefore, User 2 forces once again a fetch of the URLs and their respective content. User 3 Request - User 1 and User 2 requests are not yet completed so the same issue...

So, assuming I have 10 URLs I need to fetch, instead of opening 10 connections, one per URL and then caching the data, if I have 20 users hitting the server at the exact same time, I will be opening 200 connections (each user opens 10 connections).

How can I block User X (where X>1) from causing these events? I basically want the server to close a gate and ask every user to wait until it has populated the caches, then opening the gates once these are populated, is there any way to do this?

Upvotes: 1

Views: 672

Answers (2)

DeadAlready
DeadAlready

Reputation: 3008

This can be done by using EventEmitter class. You set up an EventEmitter

    var events = require('events');
    var eventEmitter = new events.EventEmitter();

Then you handle your incoming requests

    // here you check for url in cache with your own logic
    if(weHaveUrl){
      // Respond directly
    } else {
      // Add one time event watcher for that url
      eventEmitter.once('url-' + url, function(data){
        // We now have data so respond
      });
      // Initiate search
      searchUrl(url);
    }

And wrap your search function to emit events

    var urlSearchList = [];
    function searchUrl(url){
      // We check in case we are already looking for the data
      if(urlSearchList.indexOf(url) === -1){
        // Append url to list so we won't start a second search
        urlSearchList.push(url);

        // Your logic for searching url data
        // Once recieved we emit the event
        eventEmitter.emit('url-' + url);
        // And optionally remove from search array 
        //  if we want to repeat the search at some point
        urlSearchList.splice(urlSearchList.indexOf(url));
      }
    }

This method will answer request either immidiately if the results are in cache or will make them wait for the results from search and then return the results.

Since we keep a record of which searches are initiated then we won't start searching for the same url many times and every request will get the response as soon as results come available.

Upvotes: 3

Sean Vieira
Sean Vieira

Reputation: 160043

The simplest way to avoid this event (it's called the "thundering herd problem" by the way) is to not have any of the users run the fetchURLs code. Instead, if the cache check fails add a job to your job queue to refresh this data. Then return a message that says something to the effect of "we're sorry, we don't have that data right now - please wait while we fetch it for you". Then you just poll your endpoint for the data and once it is in cache you are all set and ready to go.

In order to prevent the job from being submitted to the queue by 100 users add a flag to another globally available data structure (possibly the same one you are using for your job queue, but not necessarily). When you experience a cache miss check for the existence of the flag for that cache key and if it does not exist, set the flag and submit a job to your job queue. In pseudo-code:

if url not in cache:
    if url not in jobLocks:
        jobLocks.add(url)
        jobQueue.add("fetchURLs", data=url)

    return "Please wait while we fetch your data"

else:
    return cache[url]

When the data in the cache goes stale you can use the same process to avoid a thundering herd on update. Instead of deleting the data and then re-fetching it, server the stale data and put a job in the queue to update the cache.

Upvotes: 1

Related Questions