Reputation: 65
I have a file that contains both text and binary code. In order for Python to process it, I have to load it as a binary file, which makes sense.
Now, the problem is, once I do that, I can't use a regular RegEx on it without some changes I don't currently understand.
I was hoping the code would be as simple the following, but it's proving not to be.
#!/usr/bin/env python
import re
s = open('./source.data', 'rb')
d = open('./dest.data', 'wb')
f = "REPEATED_TEXT_STRING"
c = s.read()
r = "^\d+ \d+ obj$(?:(?!^\d+ \d+ obj$)[\s\S])*?" + f + "[\s\S]+?^endobj$"
r = re.compile(r, re.DOTALL | re.MULTILINE)
t = r.sub('', c)
d.write(t)
I do know that r variable needs to be marked as a binary string, with a 'b' in the beginning, but it's unfortunately not as simple as that it seems for what I'm trying to do it seems.
Upvotes: 1
Views: 106
Reputation: 74655
The re
module documentation states:
Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.
Which implies that if c
is a bytes object then r
and the substitution string must also be:
f = b"REPEATED_TEXT_STRING"
c = s.read()
r = b"^\d+ \d+ obj$(?:(?!^\d+ \d+ obj$)[\s\S])*?" + f + b"[\s\S]+?^endobj$"
r = re.compile(r, re.DOTALL | re.MULTILINE)
t = r.sub(b'', c)
I forgot about f
and the other half of r
. They need to be bytes also.
Upvotes: 1