Performing a RegEx substitution on blocks of text in a Binary file

Question

I have a file that contains both text and binary code. In order for Python to process it, I have to load it as a binary file, which makes sense.

Now, the problem is, once I do that, I can't use a regular RegEx on it without some changes I don't currently understand.

I was hoping the code would be as simple the following, but it's proving not to be.

#!/usr/bin/env python

import re

s = open('./source.data', 'rb')
d = open('./dest.data', 'wb')

f = "REPEATED_TEXT_STRING"

c = s.read()

r = "^\d+ \d+ obj$(?:(?!^\d+ \d+ obj$)[\s\S])*?" + f + "[\s\S]+?^endobj$"

r = re.compile(r, re.DOTALL | re.MULTILINE)
t = r.sub('', c)

d.write(t)

I do know that r variable needs to be marked as a binary string, with a 'b' in the beginning, but it's unfortunately not as simple as that it seems for what I'm trying to do it seems.

Dan D. · Accepted Answer

The re module documentation states:

Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

Which implies that if c is a bytes object then r and the substitution string must also be:

f = b"REPEATED_TEXT_STRING"

c = s.read()

r = b"^\d+ \d+ obj$(?:(?!^\d+ \d+ obj$)[\s\S])*?" + f + b"[\s\S]+?^endobj$"

r = re.compile(r, re.DOTALL | re.MULTILINE)
t = r.sub(b'', c)

I forgot about f and the other half of r. They need to be bytes also.

Performing a RegEx substitution on blocks of text in a Binary file

Answers (1)

Related Questions