Import RSS feed to MongoDB

Question

I have a MEAN application that works as intended, Angular can pull data from my MongoDB, Express handles the API, etc.

I want to import data from an RSS feed into my database as it gets imported into the RSS feed. I originally had my app pull the JSON from the RSS feed when the page is loaded, but then every time the page is refreshed I add duplicated data from the RSS data. Is the best approach to continue pulling the feed on a page refresh and do a check to see if the __id is already in the database? Or is there a better way to incorporate the consumption of RSS data into my database.

Here is my app structure:

/app backend stuff
- /controllers all the mongoose controls for CRUD
- /models mongoose schemas/models
- route.js express routes
/config database configuration file
/node_modules node modules
/public all front end AngularJS stuff
- /css
- /js
  - /controllers angular controllers
  - /services angular services/factories
  - app.js tie angular components together
  - appRoutes.js frontend routing
- /libs angular generated libs
- /views html

Would something like this go into my /app/controllers/reviews.js?

var mongoose = require('mongoose');
var Review = require('../models/review');

// equivalent to "Create" in CRUD
exports.getAllFromFeed = function(req, res) {
    // pull RSS feed
    // create Review object from JSON
    // check for duplicate in database
    // add to mongodb
}

and then just call that on a page load?

Neil Lunn · Accepted Answer

I had to hop through your other questions to understand what you are actually asking here. Your general case though seems to come down to a couple of things:

How to best run a task periodically.
How to avoid adding duplicated data from the feed when updating

So basically here the best thing to do is manage the feed data being loaded to the collection via MongoDB "upserts", this should only create new documents when something does not exist. To do this though, you will need to manipulate the content received from the feed a little, or mainly just to use the default _id as what the unique identifier in the feed is.

Here is the basic process with a few helpers in node:

var async = require('async'),
    time = require('time'),
    CronJob = require('cron').CronJob,
    mongoose = require('mongoose'),
    Schema = mongoose.Schema,
    FeedParser = require('feedparser'),
    request = require('request');

mongoose.connect('mongodb://localhost/test');

var feedSchema = new Schema({
  _id: String
},{ strict: false });

var Feed = mongoose.model('Feed',feedSchema);

var job = new CronJob({
  cronTime: '0 0-59 * * * *',

  onTick: function() {

    var req = request('https://itunes.apple.com/us/rss/customerreviews/id=662900426/sortBy=mostRecent/xml'),
        feedparser = new FeedParser();

    var bulk = Feed.collection.initializeUnorderedBulkOp();

    req.on('error',function(err) {
      throw err;
    });

    req.on('response',function(res) {
      var stream = this;

      if (res.statusCode != 200) {
        return this.emit('error', new Error('Bad status code'));
      } else {
        console.log("res OK");
      }

      stream.pipe(feedparser);

    });

    feedparser.on('error',function(err) {
      throw err;
    });

    feedparser.on('readable',function() {

      var stream = this,
          meta = this.meta,
          item;

      while ( item = stream.read() ) {
        item._id = item.guid;
        delete item.guid;
        bulk.find({ _id: item._id }).upsert().updateOne({ "$set": item });
      }

    });

    feedparser.on('end',function() {
      console.log('at end');
      bulk.execute(function(err,response) {
        // Shouldn't be one as errors should be in the response
        // but just in case there was a problem connecting the op
        if (err) throw err;

        // Just dumping the response for demo purposes
        console.log( JSON.stringify( response, undefined, 4 ) );

      });
    });

  },
  start: true
});

mongoose.connection.on('open',function(err,db) {
  job.start();
});

Some of the things I mentioned first. The Schema definition here uses strict:false, largely because I don't want to specify all of the fields yet have mongoose handle this for me. There is a definition for _id as "String" though, and this is so the type is cast for the "id" you are going to use from the feed data is correct.

The general meat of this is set up in a "cron" job, using that node module. This sets up a periodic "job" to be run at the schedule that is specified. The timing I have used here is every minute, just to demonstrate.

Other parts implement the "feedparser" module functionality, where a request is made for the content and then put through the feedparser to work this into data you can use. Yo can optionally set up that part externally, but just in the "job" definition here as and example.

For putting the data in MongoDB, I am using the Bulk operations API here. You don't have to, but it does give a clearer picture of what is happening via the write response you get later. Otherwise general mongoose methods with "upsert" specified will do, such as .findByIdAndUpdate().

That is happening within the event fired when the parser stream is readable. Each .read() request returns the current "item" from the feed. In order to make everything happy, we aer changing the "guid" field from the original field data to be the _id field. Then you just set up the "upsert" request. In Bulk operations this just "queues" the request here.

Finally at the end the bulk operations are executed, and thus sent to the server. Here we check the response to see what actually happened.

Outside of the definition for the "job", this is just wrapped in "starting" the job only when the connection to the database is available. Generally good practice to do, though if using the mongoose model methods for "upserts" then mongoose should "queue" operations until the connection is made anyway.

What happens now is this job should fire on startup as that is how it is defined and each minute the job will run again, requesting the feed data and "upserting" it. The actual output from the write response on a blank collection will be something like this on the first run:

{
    "ok": 1,
    "writeErrors": [],
    "writeConcernErrors": [],
    "nInserted": 0,
    "nUpserted": 51,
    "nMatched": 0,
    "nModified": 0,
    "nRemoved": 0,
    "upserted": [
        {
            "index": 0,
            "_id": "https://itunes.apple.com/us/app/cox-contour-for-ipad/id662900426?mt=8&uo=2"
        },
        {
            "index": 1,
            "_id": "1024220540"
        },
        {
            "index": 2,
            "_id": "1023922797"
        },
        {
            "index": 3,
            "_id": "1023784213"
        },
        {
            "index": 4,
            "_id": "1023592061"
        }
    ]
}

And so on for however many items are being returned in the feed, as these are newly inserted to the collection. But when the next "tick" runs:

{
    "ok": 1,
    "writeErrors": [],
    "writeConcernErrors": [],
    "nInserted": 0,
    "nUpserted": 0,
    "nMatched": 51,
    "nModified": 0,
    "nRemoved": 0,
    "upserted": []
}

As there was nothing new and nothing was actually changed, it just reports that the items were "matched" and does not actually do anything else to "modify" or "insert". MongoDB is generally smart enough to know this, as long as the $set operator is used as shown.

If something did actually change in the data from the feed it would be "modified" in the case of different data or "upserted" in the case of new items present in the feed.

Alter as you need, but there is a way to set this up periodically and also avoid checking for the presence of every item in the database before deciding whether to insert it.

Import RSS feed to MongoDB

Answers (1)

Related Questions