user1136342
user1136342

Reputation: 4941

What's the fastest way to read a large file in Ruby?

I've seen answers to this question but I couldn't figure out which of the answers would perform the fastest. These are the answers I've seen- which is best?

  1. Read one line at a time using each or each_line
  2. Read one line at a time using gets
  3. Save it all into an array of lines using readlines and then use each
  4. Use grep (not sure what exactly to do with grep...)
  5. Use sed (not sure what exactly to do with sed...)
  6. Something else?

Also, would it be better to just use another language or should Ruby be fine?

EDIT:

More details: Each line contains something like "id1 attr1_1 attr2_1 id2 attr1_2 attr2_2... idn attr1_n attr2_n" (n is very big) and I need to insert those into a database. For that example line, I would need to insert n rows into the database.

Upvotes: 8

Views: 4693

Answers (2)

mdunsmuir
mdunsmuir

Reputation: 492

Ruby will likely be using the same or very similar low-level code (written in C) to do the actual reading from disk for the first three options, so they should perform similarly. Given that, you should choose whichever is most convenient for you; the ability to do that is what makes languages like Ruby so useful! You will be reading a lot of data from disk, so I would suggest using each_line and processing each line as you read it.

I would not recommend bringing grep, sed, or any other such external utilities into the picture unless you have a very good reason, as they will make your code less portable and expose you to failures that may be difficult to diagnose.

Upvotes: 5

deau
deau

Reputation: 1223

If you're using Ruby then there's no need to worry about performance. The language is such that it suits an iterative approach to reading a file, line by line, and works very nicely. So long as you're using the language the way it's designed you can let the interpreter people worry about performance. Job done.

If one particular readLargeFileFast method is needed then it should be because it's really hindering the program somehow. Now, you write a C program to do it and popen it as a separate process within your ruby code. You could call it read_large.c and (perhaps) use command line arguments to tell it how to behave.

This is championing the idea that a scripting language is used for a fast development rather than a fast run time. As such a developer can be very productive by swiftly 'prototyping' a program in something like Ruby and only later rewriting the components warrant some low level code. Often, however, once it's working in script, it's not necessary to do anything else at all.

The Ruby Docs describe launching a separate process and treating it as a file. It's easy-peasy! A good start is The Art of Linux Programming's introductory paragraph on program modularity. This book also makes a great example of using linux's standard stream editor, called sed, which you could probably use from Ruby right now.

If you need to parse or edit a lot of text then many interpreters or editors have been written around sed's functionality. Further, it may save you a lot of effort writing something super efficient if you don't know C. Good is the Introduction to SED by Bruce Barnett.

Upvotes: 3

Related Questions