Reputation: 93

Fastest way to print a certain portion of a file using bash commands

currently I am using sed to print the required portion of the file. For example, I used the below command

sed -n 89001,89009p file.xyz

However, it is pretty slow as the file size is increasing (my file is currently 6.8 GB). I have tried to follow this link and used the command

sed -n '89001,89009{p;q}' file.xyz

But, this command is only printing the 89001th line. Kindly, help me.

Upvotes: 6

Answers (6)

mklement0

Reputation: 437218

Dawid Grabowski's helpful answer is the way to go (with sed^[1] ; Ed Morton's helpful answer is a viable awk alternative; a tail+head combination will typically be the fastest^[2]).

As for why your approach didn't work:

A two-address expression such as 89001,89009 selects an inclusive range of lines, bounded by the start and end address (line numbers, in this case).

The associated function list, {p;q;}, is then executed for each line in the selected range.

Thus, line # 89001 is the 1st line that causes the function list to be executed: right after printing (p) the line, function q is executed - which quits execution right away, without processing any further lines.

To prevent premature quitting, Dawid's answer therefore separates the aspect of printing (p) all lines in the range from quitting (q) processing, using two commands separated with ;:

89001,89009p prints all lines in the range
89009q quits processing when the range's end point is reached

^{[1] A slightly less repetitive reformulation that should perform equally well ($ represents the last line, which is never reached due to the 2nd command):

sed -n '89001,$ p; 89009 q'}

^{[2] A better reformulation of the head + tail solution from Dawid's answer is

tail -n +89001 file | head -n 9, because it caps the number of bytes that are not of interest yet are still sent through the pipe at the pipe-buffer size (a typical pipe-buffer size is 64 KB).

With GNU utilities (Linux), this is the fastest solution, but on OSX with stock utilities (BSD), the sed solution is fastest.}

Upvotes: 3

dbosky

Reputation: 1641

Another way to do it will be using combination of head and tail:

$ time head -890010 large-file| tail -10 > /dev/null

real    0m0.085s
user    0m0.024s
sys     0m0.016s

This is faster than sed and awk.

Upvotes: 0

dbosky

Reputation: 1641

The syntax is a little bit different:

sed -n '89001,89009p;89009q' file.xyz

UPDATE:

Since there is also an answer with awk I made small comparison and as I thought - sed is a little bit faster:

$ wc -l large-file 
100000000 large-file
$ du -h large-file 
954M    large-file
$ time sed -n '890000,890010p;890010q' large-file > /dev/null

real    0m0.141s
user    0m0.068s
sys 0m0.000s
$ time awk 'NR>=890000{print} NR==890010{exit}' large-file > /dev/null

real    0m0.433s
user    0m0.208s
sys 0m0.008s`

UPDATE2:

There is a faster way with awk as posted by @EdMorton but still not as fast as sed:

$ time awk 'NR>=890000{print; if (NR==890010) exit}' large-file > /dev/null

real    0m0.252s
user    0m0.172s
sys     0m0.008s

UPDATE:

This is the fastest way I was able to find (head and tail):

$ time head -890010 large-file| tail -10 > /dev/null

real    0m0.085s
user    0m0.024s
sys     0m0.016s

Upvotes: 8

ronybc

Reputation: 15

It requires sed to search from the beginning of the file to find the N'th line. To make things faster, divide the large file at fixed number of lines intervals using and index file. Then use dd to skip early portions of the big file before feeding to sed.

Build the index file using:

#!/bin/bash

INTERVAL=1000
LARGE_FILE="big-many-GB-file"
INDEX_FILE="index"

LASTSTONE=123
MILESTONE=0

echo $MILESTONE > $INDEX_FILE

while [ $MILESTONE != $LASTSTONE ] ;do

LASTSTONE=$MILESTONE
MILESTONE=$(dd if="$LARGE_FILE" bs=1 skip=$LASTSTONE 2>/dev/null |head -n$INTERVAL |wc -c)
MILESTONE=$(($LASTSTONE+$MILESTONE))
echo $MILESTONE >> $INDEX_FILE
done

exit

Then search for a line using: ./this_script.sh 89001

#!/bin/bash

INTERVAL=1000
LARGE_FILE="big-many-GB-file"
INDEX_FILE="index"

LN=$(($1-1))

OFFSET=$(head -n$((1+($LN/$INTERVAL))) $INDEX_FILE |tail -n1)
LN=$(($LN-(($LN/$INTERVAL)*$INTERVAL)))
LN=$(($LN+1))
dd if="$LARGE_FILE" bs=1 skip=$OFFSET 2>/dev/null |sed -n "$LN"p

Upvotes: -2

Ed Morton

Reputation: 203229

awk 'NR>=89001{print; if (NR==89009) exit}' file.xyz

Upvotes: 4

karakfa

Reputation: 67467

easier to read in awk, performance should be similar to sed

awk 'NR>=89001{print} NR==89009{exit}' file.xyz

you can replace {print} with semicolon as well.

Upvotes: 2

Fastest way to print a certain portion of a file using bash commands

Answers (6)

Related Questions