Reputation: 9442
Is it possible to encode the output of a grep command in UTF-8 no matter what the encoding of the input file was?
I execute a grep statement in a python script (subprocess) and I want to guarantee the resulting bytes are UTF-8.
Example:
grep -P "ÄA" -m -1 file.txt
I dont know the input encoding of the file...
Upvotes: 0
Views: 4218
Reputation: 1124828
Grep follows the UNIX philosophy, that is, it does one thing, and it does this one thing well. File encoding is not part of that one thing.
That's what other tools are for. There is another tool that does character decoding and encoding well, called iconv
. Use that to change the encoding of the input file to UTF-8.
This does require you to know the input file encoding. If you don't know, you have to guess, based on heuristic analysis of the input file (it'll be hard to be certain, recognising that something has been decoded using the wrong codec usually requires a human to verify the result). There is a tool for that too, called enca
. This tool can also do the conversion once a guess has been made. It usually is a separate install (it is not part of the common default POSIX toolset). See How to auto detect text file encoding? over on Super User for more options.
Note however, given that codec guessing tools need to do so by using statistical analysis, it is better to do the guessing on the input file, not on the output of grep
.
None of this has anything to do with Python, of course. Except if you wanted to do the encoding detection in Python instead, at which point you'd want to look at the chardet
library.
Upvotes: 3