Reputation: 93
currently I am using sed to print the required portion of the file. For example, I used the below command
sed -n 89001,89009p file.xyz
However, it is pretty slow as the file size is increasing (my file is currently 6.8 GB). I have tried to follow this link and used the command
sed -n '89001,89009{p;q}' file.xyz
But, this command is only printing the 89001th line. Kindly, help me.
Upvotes: 6
Views: 2136
Reputation: 437218
Dawid Grabowski's helpful answer is the way to go (with sed
[1]
; Ed Morton's helpful answer is a viable awk
alternative; a tail
+head
combination will typically be the fastest[2]).
As for why your approach didn't work:
A two-address expression such as 89001,89009
selects an inclusive range of lines, bounded by the start and end address (line numbers, in this case).
The associated function list, {p;q;}
, is then executed for each line in the selected range.
Thus, line # 89001
is the 1st line that causes the function list to be executed: right after printing (p
) the line, function q
is executed - which quits execution right away, without processing any further lines.
To prevent premature quitting, Dawid's answer therefore separates the aspect of printing (p
) all lines in the range from quitting (q
) processing, using two commands separated with ;
:
89001,89009p
prints all lines in the range89009q
quits processing when the range's end point is reached[1] A slightly less repetitive reformulation that should perform equally well ($
represents the last line, which is never reached due to the 2nd command):
sed -n '89001,$ p; 89009 q'
[2] A better reformulation of the head
+ tail
solution from Dawid's answer is
tail -n +89001 file | head -n 9
, because it caps the number of bytes that are not of interest yet are still sent through the pipe at the pipe-buffer size (a typical pipe-buffer size is 64 KB).
With GNU utilities (Linux), this is the fastest solution, but on OSX with stock utilities (BSD), the sed
solution is fastest.
Upvotes: 3
Reputation: 1641
Another way to do it will be using combination of head and tail:
$ time head -890010 large-file| tail -10 > /dev/null
real 0m0.085s
user 0m0.024s
sys 0m0.016s
This is faster than sed and awk.
Upvotes: 0
Reputation: 1641
The syntax is a little bit different:
sed -n '89001,89009p;89009q' file.xyz
UPDATE:
Since there is also an answer with awk I made small comparison and as I thought - sed is a little bit faster:
$ wc -l large-file
100000000 large-file
$ du -h large-file
954M large-file
$ time sed -n '890000,890010p;890010q' large-file > /dev/null
real 0m0.141s
user 0m0.068s
sys 0m0.000s
$ time awk 'NR>=890000{print} NR==890010{exit}' large-file > /dev/null
real 0m0.433s
user 0m0.208s
sys 0m0.008s`
UPDATE2:
There is a faster way with awk as posted by @EdMorton but still not as fast as sed:
$ time awk 'NR>=890000{print; if (NR==890010) exit}' large-file > /dev/null
real 0m0.252s
user 0m0.172s
sys 0m0.008s
UPDATE:
This is the fastest way I was able to find (head and tail):
$ time head -890010 large-file| tail -10 > /dev/null
real 0m0.085s
user 0m0.024s
sys 0m0.016s
Upvotes: 8
Reputation: 15
It requires sed to search from the beginning of the file to find the N'th line. To make things faster, divide the large file at fixed number of lines intervals using and index file. Then use dd to skip early portions of the big file before feeding to sed.
Build the index file using:
#!/bin/bash
INTERVAL=1000
LARGE_FILE="big-many-GB-file"
INDEX_FILE="index"
LASTSTONE=123
MILESTONE=0
echo $MILESTONE > $INDEX_FILE
while [ $MILESTONE != $LASTSTONE ] ;do
LASTSTONE=$MILESTONE
MILESTONE=$(dd if="$LARGE_FILE" bs=1 skip=$LASTSTONE 2>/dev/null |head -n$INTERVAL |wc -c)
MILESTONE=$(($LASTSTONE+$MILESTONE))
echo $MILESTONE >> $INDEX_FILE
done
exit
Then search for a line using: ./this_script.sh 89001
#!/bin/bash
INTERVAL=1000
LARGE_FILE="big-many-GB-file"
INDEX_FILE="index"
LN=$(($1-1))
OFFSET=$(head -n$((1+($LN/$INTERVAL))) $INDEX_FILE |tail -n1)
LN=$(($LN-(($LN/$INTERVAL)*$INTERVAL)))
LN=$(($LN+1))
dd if="$LARGE_FILE" bs=1 skip=$OFFSET 2>/dev/null |sed -n "$LN"p
Upvotes: -2
Reputation: 67467
easier to read in awk, performance should be similar to sed
awk 'NR>=89001{print} NR==89009{exit}' file.xyz
you can replace {print}
with semicolon as well.
Upvotes: 2