Flypig
Flypig

Reputation: 69

which is the fastest way to print the specific line with shell?

I have a file with 50 million lines and I have to random 1000 lines out from it.

firstly, I create 1000 random numbers; then I use

sed -n "$random{p;q}" file  

it's really slow;One line output will cost at least 5-6 seconds.

So I think I should optmize the print specific line speed.

we have many ways to print specific line:

sed -n "$line{p;q}" file

awk "NR==$line{print}" file

head -$line file | tail -1

it's all slow...cost about 5-6 seconds to print a specific line.

Is there any other ways in shell to print a specific line?Or python,perl can be faster than shell? Or my way to solve this problem wrong?

----------------------------------------SPLIT----------------------------------------------

iterate 1000 random numbers and each time use shell once, may generate 1000 times io operations.Maybe I should use a array to save the random numbers first and iterate the file once.

random_array=()

awk '{if ( NR in $random_array ) print;}' file

Well, I will test this way and paste the result any later

Upvotes: 2

Views: 523

Answers (4)

tripleee
tripleee

Reputation: 189397

To avoid reading the entire file, you could fetch the file's size, then generate a list of 1000 offsets between 0 and that number. Those will usually be positions in the middle of a line, but you could read through to the next newline, then read and print the following line. However, this introduces a bias against the first line of the file. If you have a guesstimate for the average line length, you could subtract that number from the generated offsets (any negative outcome would mean to read and print from offset 0.)

Here is a quick proof of concept. For illustration purposes, I assumed an average line length of about 75 characters. This, too, affects the fairness (there's a higher probability that a line after a long line will be selected). The handling of the last line is also not fair; if it is shorter than 75 characters, it can never be selected (!) -- you can attempt to fix that by calculating the actual average line length from the lines you actually read, but I leave that as an excercise, in order to keep this example reasonably compact.

#!/usr/bin/perl

use strict;
use warnings;

use Fcntl (qw(SEEK_SET SEEK_CUR SEEK_END));

my $n = (defined @ARGV ? shift @ARGV : '--help');
die "Syntax: $0 number file\n" unless @ARGV == 1 and $n =~ m/^[0-9]+$/;

open (F, "<", $ARGV[0]) or die "$0: Could not open $ARGV[0]: $!\n";

seek (F, 0, SEEK_END) or die "$0: Could not SEEK_END $ARGV[0]: $!\n";
my $max = tell(F);

my %seen;
for (my $i=0; $i < $n; ++$i)
{
    my $offset = int(rand($max))-75;
    my $first = 0;
    if ($offset < 0)
    {
        $offset = 0;
        $first = 1;
    }
    seek (F, $offset, SEEK_SET)
        or die "$0: Could not SEEK_SET $ARGV[0]: $!\n";
    <F> unless $first;
    redo if eof (F);   # Cheap trick, just retry if at eof
    redo if $seen{tell(F)}++;
    print scalar(<F>);
}

I added code to avoid duplicates; this is the %seen hash.

Upvotes: 2

kevinsun
kevinsun

Reputation: 1

If you just want a specific line from a larger scale data file, the cost will increase according to your request. If your file is immutable during a period (a week or longer), pretreatment will be necessary, there is just a solution for your problem:

  1. split a file into some smaller size with the same line
  2. paste each file into a single file; after that, link 1 will contain information of 1 1+n 1+2n information
  3. a wrap shell to calculate line will be necessary.

As you know, above is just a method.

Upvotes: 0

Roman Cheplyaka
Roman Cheplyaka

Reputation: 38718

Regardless of which tool you use, there is inherent cost in finding those lines. In essence, you need to traverse that large file each time, finding and counting the newline symbols.

There are two solutions I can see:

  1. Precompute the line offsets in the file in one pass, and then use lseek to find a print them. You can store every 100th or 1000th line offset to save the space.

  2. Generate the whole list of line numbers upfront and gather the lines in one pass over the file. Then print them. (You can't print as you go if you want the order of the lines to be random).

Either of these would be hard to do in shell. For a shell-only solution, try devnull's suggestion, shuf. But instead of 1 you'd want to use 1000:

shuf -n 1000 file

Upvotes: 0

Vijay
Vijay

Reputation: 67221

In the order of lines in the file, without all lines in memory:

awk '
  NR==FNR { next }
  FNR==1{
    srand;
    n=NR-1
    for(i=1; i<=1000; i++) {
      line=0
      while(!line || line in A) line=int(rand*n)+1
      A[line]
    }
  } 
  FNR in A
' infile infile

Upvotes: 1

Related Questions