Reputation: 69
I have a file with 50 million lines and I have to random 1000 lines out from it.
firstly, I create 1000 random numbers; then I use
sed -n "$random{p;q}" file
it's really slow;One line output will cost at least 5-6 seconds.
So I think I should optmize the print specific line speed.
we have many ways to print specific line:
sed -n "$line{p;q}" file
awk "NR==$line{print}" file
head -$line file | tail -1
it's all slow...cost about 5-6 seconds to print a specific line.
Is there any other ways in shell to print a specific line?Or python,perl can be faster than shell? Or my way to solve this problem wrong?
----------------------------------------SPLIT----------------------------------------------
iterate 1000 random numbers and each time use shell once, may generate 1000 times io operations.Maybe I should use a array to save the random numbers first and iterate the file once.
random_array=()
awk '{if ( NR in $random_array ) print;}' file
Well, I will test this way and paste the result any later
Upvotes: 2
Views: 523
Reputation: 189397
To avoid reading the entire file, you could fetch the file's size, then generate a list of 1000 offsets between 0 and that number. Those will usually be positions in the middle of a line, but you could read through to the next newline, then read and print the following line. However, this introduces a bias against the first line of the file. If you have a guesstimate for the average line length, you could subtract that number from the generated offsets (any negative outcome would mean to read and print from offset 0.)
Here is a quick proof of concept. For illustration purposes, I assumed an average line length of about 75 characters. This, too, affects the fairness (there's a higher probability that a line after a long line will be selected). The handling of the last line is also not fair; if it is shorter than 75 characters, it can never be selected (!) -- you can attempt to fix that by calculating the actual average line length from the lines you actually read, but I leave that as an excercise, in order to keep this example reasonably compact.
#!/usr/bin/perl
use strict;
use warnings;
use Fcntl (qw(SEEK_SET SEEK_CUR SEEK_END));
my $n = (defined @ARGV ? shift @ARGV : '--help');
die "Syntax: $0 number file\n" unless @ARGV == 1 and $n =~ m/^[0-9]+$/;
open (F, "<", $ARGV[0]) or die "$0: Could not open $ARGV[0]: $!\n";
seek (F, 0, SEEK_END) or die "$0: Could not SEEK_END $ARGV[0]: $!\n";
my $max = tell(F);
my %seen;
for (my $i=0; $i < $n; ++$i)
{
my $offset = int(rand($max))-75;
my $first = 0;
if ($offset < 0)
{
$offset = 0;
$first = 1;
}
seek (F, $offset, SEEK_SET)
or die "$0: Could not SEEK_SET $ARGV[0]: $!\n";
<F> unless $first;
redo if eof (F); # Cheap trick, just retry if at eof
redo if $seen{tell(F)}++;
print scalar(<F>);
}
I added code to avoid duplicates; this is the %seen
hash.
Upvotes: 2
Reputation: 1
If you just want a specific line from a larger scale data file, the cost will increase according to your request. If your file is immutable during a period (a week or longer), pretreatment will be necessary, there is just a solution for your problem:
As you know, above is just a method.
Upvotes: 0
Reputation: 38718
Regardless of which tool you use, there is inherent cost in finding those lines. In essence, you need to traverse that large file each time, finding and counting the newline symbols.
There are two solutions I can see:
Precompute the line offsets in the file in one pass, and then use lseek
to find a print them. You can store every 100th or 1000th line offset to save the space.
Generate the whole list of line numbers upfront and gather the lines in one pass over the file. Then print them. (You can't print as you go if you want the order of the lines to be random).
Either of these would be hard to do in shell. For a shell-only solution, try devnull's suggestion, shuf
. But instead of 1 you'd want to use 1000:
shuf -n 1000 file
Upvotes: 0
Reputation: 67221
In the order of lines in the file, without all lines in memory:
awk '
NR==FNR { next }
FNR==1{
srand;
n=NR-1
for(i=1; i<=1000; i++) {
line=0
while(!line || line in A) line=int(rand*n)+1
A[line]
}
}
FNR in A
' infile infile
Upvotes: 1