Zeus
Zeus

Reputation: 1525

Searching a file non-sequentially

Usually when I search a file with grep, the search is done sequentially. Is it possible to perform a non-sequential search or a parallel search? Or for example, a search between line l1 and line l2 without having to go through the first l1-1 lines?

Upvotes: 0

Views: 98

Answers (3)

Andrea Ratto
Andrea Ratto

Reputation: 835

if your lines are fixed length, you can use dd to read a particular section of the file:

dd if=myfile.txt bs=<line_leght> count=<lines_to_read> skip=<start_line> | other_commands

Note that dd will read from disk using the block size specified for input (bs). That might be slow and could be batched, by reading a group of lines at once so that you pull from disk at least 4kb. In this case you want to look at skip_bytes and count_bytes flags to be able to start and end at lines that are not multiple of your block size. Another interesting option is the output block size obs, which could benefit from being either the same of input or a single line.

Upvotes: 1

b4hand
b4hand

Reputation: 9770

You can use tail -n +N file | grep to begin a grep at a given line offset.

You can combine head with tail to search over just a fixed range.

However, this still must scan the file for end of line characters.

In general, sequential reads are the fastest reads for disks. Trying to do a parallel search will most likely cause random disk seeks and perform worse.

For what it is worth, a typical book contains about 200 words per page. At a typical 5 letters per word, you're looking at about 1kb per page, so 1000 pages would still be 1MB. A standard desktop hard drive can easily read that in a fraction of a second.

You can't speed up disk read throughput this way. In fact, I can almost guarantee you are not saturating your disk read rate right now for a file that small. You can use iostat to confirm.

If your file is completely ASCII, you may be able to speed things up by setting you locale to the C locale to avoid doing any type of Unicode translation.

If you need to do multiple searches over the same file, it would be worthwhile to build a reverse index to do the search. For code there are tools like exuberant ctags that can do that for you. Otherwise, you're probably looking at building a custom tool. There are tools for doing general text search over large corpuses, but that's probably overkill for you. You could even load the file into a database like Postgresql that supports full text search and have it build an index for you.

Padding the lines to a fixed record length is not necessarily going to solve your problem. As I mentioned before, I don't think you have an IO throughout issue, you could see that yourself by simply moving the file to a temporary ram disk that you create. That removes all potential IO. If that's still not fast enough for you then you're going to have to pursue an entirely different solution.

Upvotes: 1

Gerard van Helden
Gerard van Helden

Reputation: 1602

The simple answer is: you can't. What you want contradicts itself: You don't want to scan the entire file, but you want to know where each line ends. You can't know where each line ends without actually scanning the file. QED ;)

Upvotes: 0

Related Questions