Bash - replacing small range of lines in a huge text file efficiently

Question

I'm working on a problem where I have to replace a certain range of lines in a huge text file with data from another (smaller, but still large) text file.

Suppose file1 has 10,000 lines, and file2 3,000 lines. I need to perform operations of the type: extract lines 901-970 from file2 and insert those into lines 8701-8770 of file1, replacing what was there before. In the problem I'm working on file1 has 61 million lines, and file2 18 million. I need this operation to be done efficiently because it is performed several times - in the end the whole content of file2 will be somewhere inside file1.

The best solution I've got so far consists of splitting the two files in small files, each having the number of lines of the block that is copied and replaced (70 in the example above). This proved to be much more efficient than a head and tail combination to extract parts of the file, but still it requires touching parts of file1 that are not modified.

I was wondering if there is a awk/grep/sed solution to this. Extracting a part of file2 is not the problem, but I couldn't figure out how to replace a block of lines of file1 without loading the entire file.

Thanks!

Jeff Y · Accepted Answer

The problem is that you'd have to do a random-access type of operation (as distinct from sequential processing) to "avoid touching" the parts of file1 that don't change, and random-access for files is at the character/byte level, not the line level. That is, if the number of bytes (as opposed to lines) being replaced in file1 was the same as the number of bytes coming from file2, you could do it (with fseek and the like). But it sounds like that is in no way guaranteed?

So you're going to have to do a single-pass over file1 regardless, so the key will be optimizing the processing within the loop (over file1 lines). Consider processing all the pieces of file2 with one pass over file1? (Rather than multiple operations involving both files.)

Bash - replacing small range of lines in a huge text file efficiently

Answers (2)

Related Questions