user31641
user31641

Reputation: 175

Extract specified lines from a file

I have a file and I want to extract specific lines from that file like lines 2, 10, 15,21, .... and so on. There are around 200 thousand lines to be extracted from the file. How can I do it efficiently in bash

Upvotes: 2

Views: 180

Answers (6)

Dmitry Alexandrov
Dmitry Alexandrov

Reputation: 1773

$ gawk 'ARGIND==1 { L[$0]++ }; ARGIND==2 && FNR in L' lines file > file.lines

Wanted line numbers have to be stored in lines delimited by newline and they may safely be in random order. It almost exactly the same as @Mark Setchell’s second method, but uses a little more clear way to determine which file is current. Although this ARGIND is GNU extension, so gawk. If you are limited to original AWK or mawk, you can write it as:

$ awk 'FILENAME==ARGV[1] { L[$0]++ }; FILENAME==ARGV[2] && FNR in L' lines file > file.lines

Efficiency test:

$ awk 'BEGIN { for (i=1; i<=1000000; i++) print i }' > file
$ shuf -i 1-1000000 -n 200000 > lines
$ time gawk 'ARGIND==1 { L[$0]++ }; ARGIND==2 && FNR in L' lines file > file.lines

real    0m1.734s
user    0m1.460s
sys     0m0.052s

UPD:

As @Costi Ciudatu pointed out, there is room for impovement for the case when all wanted lines are in the head of a file.

#!/usr/bin/gawk -f

ARGIND==1 { L[$0]++ }
ENDFILE { L_COUNT = FNR }

ARGIND==2 && FNR in L { L_PRINTED++; print }
ARGIND==2 && L_PRINTED == L_COUNT { exit 0 }

Sript interrupts when last line is printed, so now it take few milliseconds to filter out 2000 random lines from first 1 % of a one million lines file.

$ time ./getlines.awk lines file > file.lines

real    0m0.016s
user    0m0.012s
sys     0m0.000s

While reading a whole file still takes about a second.

$ time gawk 'ARGIND==1 { L[$0]++ }; ARGIND==2 && FNR in L' lines file > file.lines

real    0m0.780s
user    0m0.756s
sys     0m0.016s

Upvotes: 1

Costi Ciudatu
Costi Ciudatu

Reputation: 38195

If the lines you're interested in are close to the beginning of the file, you can make use of head and tail to efficiently extract specific lines.

For your example line numbers (assuming that list doesn't go on until close to 200,000), a dummy but still efficient approach to read those lines would be the following:

for n in 2 10 15 21; do
    head -n $n /your/large/file | tail -1
done

Upvotes: 0

tripleee
tripleee

Reputation: 189387

Provided your system supports sed -f - (i.e. for sed to read its script on standard input; it works on Linux, but not on some other platforms) you can turn the file of line numbers into a sed script, naturally using sed:

sed 's/$/p/' lines | sed -n -f - inputfile >output

Upvotes: 0

Mark Setchell
Mark Setchell

Reputation: 207465

Put the linenumbers of the lines you want in a file called "wanted", like this:

2
10
15
21

Then run this script:

#!/bin/bash
while read w
do
   sed -n ${w}p yourfile
done < wanted

TOTALLY ALTERNATIVE METHOD

Or you could let "awk" do it all for you, like this which is probably miles faster since you won't have to create 200,000 sed processes:

awk 'FNR==NR{a[$1]=1;next}{if(FNR in a){print;}}' wanted yourfile

The FNR==NR portion detects when awk is reading the file called "wanted" and if so, it sets element "$1" of array "a" to "1" so we know that this line number is wanted. The stuff in the second set of curly braces is active when processing your bigger file only and it prints the current line if its linenumber is in the array "a" we created when reading the "wanted" file.

Upvotes: 1

foxli
foxli

Reputation: 11

Maybe looking for: sed -n -e 1p -e 4p afile

Upvotes: 1

Ashish
Ashish

Reputation: 1952

sed Example

sed -n '2p' file

awk Example

awk 'NR==2' file

this will print 2nd line of file

use same logic in loop & try.

say a for loop

for VARIABLE in 2 10 15 21 
  do
    awk "NR==$VARIABLE" file

 done

Give your line numbers this way..

Upvotes: -2

Related Questions