est
est

Reputation: 577

How can I search a large sorted file in Perl?

Can you suggest me any CPAN modules to search on a large sorted file?

The file is a structured data about 15 million to 20 million lines, but I just need to find about 25,000 matching entries so I don't want to load the whole file into a hash.

Thanks.

Upvotes: 1

Views: 1333

Answers (5)

Randal Schwartz
Randal Schwartz

Reputation: 44131

Sounds like you really want a database. Consider SQLite, using Perl's DBI and DBD::SQLite modules.

Upvotes: 3

brian d foy
brian d foy

Reputation: 132876

You don't want to search the file, so do what you can to avoid it. We don't know much about your problem, but here are some tricks I've used in previous problems, all of which try to do work ahead of time:

  • Break up the file into a database. That could be SQLite, even.
  • Pre-index the file based on the data that you want to search.
  • Cache the results from previous searches.
  • Run common searches ahead of time, automatically.

All of these trade storage space to for speed. Some some these I would set up as overnight jobs so they were ready for people when they came into work.

You mention that you have structured data, but don't say any more. Is each line a complete record? How often does this file change?

Upvotes: 3

jrockway
jrockway

Reputation: 42684

A scan over the whole file may be the fastest way. You can also try File::Sorted, which will do a binary search for a given record. Locating one record in a 25 million line file should require about 15-20 seeks for each record. This means that to search for 25,000 records, you would only need around .5 million seeks/comparison, compared to 25,000,000 to naively examine each row.

Disk IO being what it is, you may want to try the easy way first, but File::Sorted is a theoretical win.

Upvotes: 5

dave
dave

Reputation: 11985

Perl is well-suited to doing this, without the need for an external module (from CPAN or elsewhere).

Some code:

while (<STDIN>) {
    if (/regular expression/) {
         process each matched line
    }
}

You'll need to come up with your own regular expression to specify which lines you want to match in your file. Once you match, you need your own code to process each matched line.

Put the above code in a script file and run it with your file redirected to stdin.

Upvotes: 6

carillonator
carillonator

Reputation: 4743

When you process an input file with while ( <$filehandle> ), it only takes the file one line at a time (for each iteration of the loop), so you don't need to worry about it clogging up your memory. Not so with a for loop, which slurps the whole file into memory. Use a regex or whatever else to find what you're looking for and put that in a variable/array/hash or write it out to a new file.

Upvotes: 2

Related Questions