Reputation: 13
I'm working on a problem where I have to replace a certain range of lines in a huge text file with data from another (smaller, but still large) text file.
Suppose file1 has 10,000 lines, and file2 3,000 lines. I need to perform operations of the type: extract lines 901-970 from file2 and insert those into lines 8701-8770 of file1, replacing what was there before. In the problem I'm working on file1 has 61 million lines, and file2 18 million. I need this operation to be done efficiently because it is performed several times - in the end the whole content of file2 will be somewhere inside file1.
The best solution I've got so far consists of splitting the two files in small files, each having the number of lines of the block that is copied and replaced (70 in the example above). This proved to be much more efficient than a head and tail combination to extract parts of the file, but still it requires touching parts of file1 that are not modified.
I was wondering if there is a awk/grep/sed solution to this. Extracting a part of file2 is not the problem, but I couldn't figure out how to replace a block of lines of file1 without loading the entire file.
Thanks!
Upvotes: 0
Views: 151
Reputation: 13
Following Jeff Y's suggestion I used the dd command to do the replacements efficiently at the byte level. I first extract a block from file2 using:
dd if="file2" bs="$bperelem" skip="$start_copy" count=1 of="tmp2" 2> /dev/null
where bperelem
is the number of bytes of the block and start_copy
is the position where it is located. Then I replace this into file1 using the following:
dd if="tmp2" bs="$bperelem" skip=0 count=1 seek="$start_replace" of="file1" conv=notrunc 2> /dev/null
For my specific problem the variables start_copy
and start_replace
are updated inside a while loop.
Upvotes: 1
Reputation: 2456
The problem is that you'd have to do a random-access type of operation (as distinct from sequential processing) to "avoid touching" the parts of file1 that don't change, and random-access for files is at the character/byte level, not the line level. That is, if the number of bytes (as opposed to lines) being replaced in file1 was the same as the number of bytes coming from file2, you could do it (with fseek and the like). But it sounds like that is in no way guaranteed?
So you're going to have to do a single-pass over file1 regardless, so the key will be optimizing the processing within the loop (over file1 lines). Consider processing all the pieces of file2 with one pass over file1? (Rather than multiple operations involving both files.)
Upvotes: 1