Reputation: 2452
I have a text file that contains some binary data. When I read the file, using Python 3, in text mode I get an UniCodeDecodeError (codec can't decode byte...) with the following lines of code:
fo = open('myfile.txt, 'r')
for line in inFile:
How can I remove the binary data from my file. I have a header that is printed just before each binary data (in this case it is shown as Data Block). For example, my file looks like such where I want to remove the çºí?¼Èדñdí:
myfile.txt:
ABCDEFGH
123456
Data Block 11
çºí?¼Èדñdí
XYZ123
The result I want is for myfile.txt to look like this:
ABCDEFGH
123456
Data Block 11
XYZ123
Upvotes: 4
Views: 5495
Reputation: 30250
This is difficult, because "binary" blobs may contain valid characters or character sequences. And if you're using a file that has "text" using multi-byte encoding, forget about it.
If you know the "text" in your file only contains single-byte characters, one approach would be to read the file in as bytes, then use something like
encode('ascii', error='ignore')
This effectively strips non-ascii characters out of the output, but if you were to do this on your file, you'd get:
ABCDEFGH 123456 Data Block ?d XYZ123
Note the second to last line -- valid ascii characters were found in the blob and treated as "text".
You may start with a solution like that, and fine-tune it (if possible) to meet your needs. Maybe the blobs occur by themselves on lines so that if a line has any non-ascii characters, throw out the entire line completely. Maybe you can look at the blobs and try to grok some structure it has. Maybe you just settle for having random lines of partial characters in there and handle them somehow later. It's kind of application-specific at that point.
Here's the code I used to produce that output from your sample input:
def strip_nonascii(b):
return b.decode('ascii', errors='ignore')
with open('garbled.txt', 'rb') as f:
for line in f:
print(strip_nonascii(line), end='')
Upvotes: 7
Reputation: 46
If you also have footer after binary data (like you are having header), try to replace everything between header/footer with nothing with regexp?
Upvotes: -1