o17t H1H' S'k
o17t H1H' S'k

Reputation: 2745

Removing a large multi-line string from a very large file

I have a 10 GB text file, from which I want to find and delete a multi-line chunk. This chunk is given as another 10 MB text file, constituting a contentious section appearing once in the large file and spanning complete lines. Assuming I do not have enough memory to process the whole 10 GB in memory, what would be the easiest way to do so in some scripting language?

Example:

big.txt:

...

I have a 10 GB text file, from which I want to find and delete a multi-line chunk.

This chunk is given as another 10 MB text file,

constituting a contentious section appearing once in the large file and spanning complete lines.

Assuming I do not have enough memory to process the whole 10 GB in memory,

what would be the easiest way to do so in some scripting language?

...

chunk.txt:

This chunk is given as another 10 MB text file,

constituting a contentious section appearing once in the large file and spanning complete lines.

result.txt:

...

I have a 10 GB text file, from which I want to find and delete a multi-line chunk.

Assuming I do not have enough memory to process the whole 10 GB in memory,

what would be the easiest way to do so in some scripting language?

...

Upvotes: 1

Views: 430

Answers (1)

o17t H1H' S'k
o17t H1H' S'k

Reputation: 2745

Following this comment, I implemented a python script to solve my issue using mmap, and it also works in more general conditions:

  • does not require complete lines
  • deals with multiple non-overlapping matches
  • deal with multiple chunk files by decreasing file size
  • works with bytes
  • chunks can be very large themselves

Code:

"""Usage: python3 delchunk.py BIGFILE CHUNK_FILE_OR_FOLDER [OUTFILE]
Given a large file BIGFILE, delete all complete non-overlapping possibly large chunks given by CHUNK_FILE_OR_FOLDER
Multiple chunks will be deleted from the largest to the smallest
If OUTFILE is not given, result will be saved to BIGFILE.delchunk
"""


import mmap
import os
import shutil
import sys


if len(sys.argv) < 3:
    print(__doc__)
    sys.exit(1)
output = sys.argv[3] if len(sys.argv) > 3 else sys.argv[1] + '.delchunk'
if sys.argv[1] != output:
    shutil.copy(sys.argv[1], output)
if os.path.isdir(sys.argv[2]):
    chunks = sorted([os.path.join(sys.argv[2], chunk) for chunk in os.listdir(sys.argv[2]) if os.path.isfile(os.path.join(sys.argv[2], chunk))], key=os.path.getsize, reverse=True)
else:
    chunks = [sys.argv[2]]
with open(output, 'r+b') as bigfile, mmap.mmap(bigfile.fileno(), 0) as bigmap:
    for chunk in chunks:
        with open(chunk, 'rb') as chunkfile, mmap.mmap(chunkfile.fileno(), 0, access=mmap.ACCESS_READ) as chunkmap:
            i = 0
            while True:
                start = bigmap.rfind(chunkmap)
                if start == -1:
                    break
                i += 1
                end = start + len(chunkmap)
                print('Deleting chunk %s (%d) at %d:%d' % (chunk, i, start, end))
                bigmap.move(start, end, len(bigmap) - end)
                bigmap.resize(len(bigmap) - len(chunkmap))
            if not i:
                print('Chunk %s not found' % chunk)
            else:
                bigmap.flush()

https://gist.github.com/eyaler/971efea29648af023e21902b9fa56f08

Upvotes: 1

Related Questions