Reputation: 1314

Is there a better solution than ActiveRecord for batch data imports?

I've developed a web interface for a legacy (vendor) database using Ruby on Rails. The database schema is a complete mess, > 450 tables, and customer data spread over more than 20, involving complex joins, etc.

I've got a good solution for this for the web app, it works very well. But we also do nightly imports from external data sources (currently a view to a SQL Server DB and a SOAP feed) and they run SLOW. About 1.5-2.5 hours for the XML data import and about 4 hours for the DB import.

This is after doing some basic optimizations, which include manually starting the MRI garbage collector. And that right there suggests to me I'm Doing It Wrong. I've considered moving the nightly update/insert tasks out of the main Rails app and trying to use either JRuby or Rubinius to take advantage of the better concurrency and garbage collection.

My question is this: I know ActiveRecord isn't really designed for this type of task. But out of the O/RM options for Ruby (my preferred language), it seems to have the best Oracle support.

What would you do? Stick with AR and use a different interpreter? Would that really help? What about DataMapper or Sequel? Is there a better way of doing this?

I'm open to using Scala or Clojure if there's a better alternative (not limited to, but these are the other languages I'm playing with right now)... but what I don't want is something like DBI where I'm writing straight SQL, if for no other reason than that vendor updates occasionally change the DB schema, and I'd rather change a couple of classes than hundreds of UPDATE or INSERT statements.

Hopefully this question isn't 'too vague,' but I could really use some advice about this issue.

FWIW, Ruby is 1.9.2, Rails is 3.0.7, platform is OS X Server Snow Leopard (or optionally Debian 6.0).

Upvotes: 4

Answers (3)

mike

Reputation: 431

Edit ok just realized that this solution will not work for oracle, sorry ---

You should really check out ActiveRecord-Import, it is easy to use and handles bulk imports with minimal amounts of sql statements. I saw a speed up from 5 hours to 2 minutes. And it will still run validations on the data.

from the github page:

books = []
10.times do |i| 
    books << Book.new(:name => "book #{i}")
end
Book.import books

https://github.com/zdennis/activerecord-import

Upvotes: 3

RHSeeger

Reputation: 16282

From my experience, ORMs are a great tool to use on the front end, where you're mostly just reading the data or updating a single row at a time. On the back end where you're ingesting lost of data at a time, they can cause problems because of the way they tend to interact with the database.

As an example, assume you have a Person object that has a list of Friends that is long (lets say 100 for now). You create the Person object and assign 100 Friends to it, and then save it to the database. It's common for the naive use of an ORM to do 101 writes to the database (one for each Friend, one for the Person). If you were to do this in pure SQL at a lower level, you'd do 2 writes, one for Person and then one for all the Friends at once (an insert with 100 actual rows). The difference between the two actions is significant.

There are a couple ways I've seen to work around the problem.

Use a lower level database API that lets you write your "insert 100 friends in a single call" type command
Use an ORM that lets you write lower level SQL in order to do the Friends insert as a single SQL command (not all of them allow this and I don't know if Rails does)
Use an ORM that lets you batch writes into a single database call. It's still 101 writes to the database, but it allows the ORM to batch them into a single network call to the database and say "do these 101 things". I'm not sure what ORMs allow for this.
There's probably other ways

The main point being that using the ORM to ingest any real sized amount of data can run into efficiency problems. Understanding what the ORM is doing underneath the hood (asking it to log all db calls is a good way to understand what it's doing) is the best first step. Once you know what it's doing, you can look for ways to tell it "what I'm doing doesn't fit well into the normal pattern, lets change how you're using it"... and, should it not have a way that works, you can look at using a lower level API to allow for it.

I'll point out one other thing you can look at with a STRONG caveat that it should be one of the last things you consider. When inserting rows into the database in bulk, you can create a raw text file with all the data (format depends on the db, but the concept is similar to a CSV file) and give the file to the database to import in bulk. It's a bad way to go in almost every case, but I wanted to include it because it does exist as an option.

Edit: As a side note, the comment about more efficiently parsing the XML is a good thing to look at too. Using SAX vs DOM, or a different XML library, can be a huge win in time to completion. In some cases, it can be an even bigger win than more efficient database interaction. For example, you may be parsing a LOT of XML with lots of small pieces of data, and then only use small parts of it. In a case like that, the parsing could take a long time via DOM while SAX could ignore the parts you don't need... or it could be using a lot of memory creating DOM objects and slow down the whole thing due to garbage collection, etc. At the very least, it's worth looking at.

Upvotes: 2

dimitarvp

Reputation: 2383

Since your question is indeed "a bit vague", I can only recommend you optimizing the XML import by using XML Pull parsing.

Take a look at this: https://gist.github.com/827475

I needed to import MySQL XML, and to be fair, using the XML Pull method improved the parse part in factor of around 7 (yes, almost 7 times faster than reading the entire thing in the memory).

Another thing: you are saying "the DB import takes 4 hours". What file formats are these DB exports you are importing?

Upvotes: 1

Is there a better solution than ActiveRecord for batch data imports?

Answers (3)

Related Questions