mike
mike

Reputation: 3519

bash pull certain lines from a file

I was wondering if there is a more efficient way to get this task done. I am working with files with the number of lines ranging from a couple hundred thousand to a couple million. Say I know that lines 100,000 - 125,000 are the lines that contain the data I am looking for. I would like to know if there is a quick way to pull just these desired lines from the file. Right now I am using a loop with grep like this:

 for ((i=$start_fid; i<=$end_fid; i++))
  do
    grep "^$i " fulldbdir_new >> new_dbdir${bscnt}
  done

Which works fine its just is taking longer than I would like. And the lines contain more than just numbers. Basically each line has about 10 fields with the first being a sequential integer that appears only once per file.

I am comfortable writing in C if necessary.

Upvotes: 18

Views: 9725

Answers (4)

Ole Tange
Ole Tange

Reputation: 33695

The answers so far reads the first 100000 lines and discards them. As disk I/O is often the limiting factor these days it might be nice to have a solution that does not have to read the unwanted lines.

If the first 100000 lines are always the same total length (approximately), then you might compute how far to seek into the file to get to approximately line 100000 and then read the next 25000 lines. Maybe read a bit more before and after to make sure you have all the 25000 lines.

You would not know exactly what line you were at, though, which may or may not be important for you.

Assume the average line length of the first 100000 lines is 130 then you would get something like this:

 dd if=the_file skip=130 bs=100000 | head -n 25000

You would have to throw away the first line, as it is likely to be only half a line.

Upvotes: 1

mhyfritz
mhyfritz

Reputation: 8522

I'd use awk:

awk 'NR >= 100000; NR == 125000 {exit}' file

For big numbers you can also use E notation:

awk 'NR >= 1e5; NR == 1.25e5 {exit}' file

EDIT: @glenn jackman's suggestion (cf. comment)

Upvotes: 6

Costa
Costa

Reputation: 2043

sed can do the job...

sed -n '100000,125000p' input

EDIT: As per glenn jackman's suggestion, can be adjusted thusly for efficiency...

sed -n '100000,125000p; 125001q' input

Upvotes: 24

gpojd
gpojd

Reputation: 23075

You can try a combination of tail and head to get the correct lines.

head -n 125000 file_name | tail -n 25001 | grep "^$i "

Don't forget perl either.

perl -ne 'print if $. >= 100000 && $. <= 125000' file_name | grep "^$i "

or some faster perl:

perl -ne 'print if $. >= 100000; exit() if $. >= 100000 && $. <= 125000' | grep "^$i "

Also, instead of a for loop you might want to look into using GNU parallel.

Upvotes: 2

Related Questions