user3535074
user3535074

Reputation: 1338

subprocess stdout string decoding not working

I'm using the following subprocess call to use a command line tool. The output of the command line tool isn't printed in one go, it prints immediately on the command line, it generates over multiple lines over a period of time. The tool is bs1770gain and the command would be "path\to\bs1770gain.exe" "-i" "\path\to\audiofile.wav", By using the --loglevel parameter you can include more data but you cannot remove the progressive results being written to stdout.

I need stdout to return a human readable string (hence the stdout_formatted operation):

with subprocess.Popen(list_of_args, stdout=subprocess.PIPE,  stderr=subprocess.PIPE) as proc:
    stdout, stderr = proc.communicate()
    stdout_formatted = stdout.decode('UTF-8')
    stderr_formatted = stderr.decode('UTF-8')

However I can only view the variable as a human readable string if I print it e.g.

In [23]: print(stdout_formatted )
      nalyzing ...   [1/2] "filename.wav": 
          integrated:  -2.73 LUFS / -20.27 LU   [2/2] 
      "filename2.wav":         
          integrated:  -4.47 LUFS / -18.53 LU   
      [ALBUM]:
          integrated:  -3.52 LUFS / -19.48 LU done.

In [24]: stdout_formatted 
Out[24]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\.......

In [6]: stdout
Out[6]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\......

In [4]: type(stdout)
Out[4]: bytes

In [5]: type(stdout_formatted)
Out[5]: str

If you look carefully, the human readable chars are in the string (the first word is "analyzing"

I guessed that the stdout value needs decoding/encoding so I tried different ways:

stdout_formatted.encode("ascii")
Out[18]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g

stdout_formatted.encode("utf-8")
Out[17]: b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\

stdout.decode("utf-8")
Out[15]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\

stdout.decode("ascii")
Out[14]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\

bytes(stdout).decode("ascii")
Out[13]: 'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00g\

I used a library called chardet to check the encoding of stdout:

import chardet

chardet.detect(stdout)
Out[26]: {'confidence': 1.0, 'encoding': 'ascii', 'language': ''}

I'm working on Windows 10 and have am using python 3.6 (the anaconda package and it's integrated Spyder IDE).

I'm kind of clutching at straws now - is it possible to capture what is displayed in the console when print is called in a variable or remove the unwanted bytecode in the stdout string?

Upvotes: 1

Views: 15126

Answers (3)

fred727
fred727

Reputation: 2819

I had the same problem with the progress information added by bs1770gain. It is not said in the man page but there is a "--suppress-progress" flag !

So I just use this command line and everythig were OK ;)

bs1770gain --xml --suppress-progress "$filename"

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1122092

You don't have UTF-8 data. You have UTF-16 data. UTF-16 uses two bytes for every character; characters in the ASCII and Latin-1 ranges (such as a), still use 2 bytes, but one of those bytes is always a \x00 NUL byte.

Because UTF-16 always uses 2 bytes for every character, their order starts to matter. Encoders can pick between the two options; one is called Little Endian, the other Big Endian. Normally, encoders then include a Byte Order Mark at the very start, so that the decoder knows which of the two order options to use when decoding.

Your posted data doesn't appear to include the BOM (I don't see the 0xFF and 0xFE bytes, but your data does look like it is using little-endian ordering. That fits with this being Windows; Windows always uses little-endian ordering for it's UTF-16 output.

If your data does have the BOM present, you can just decode as 'utf-16'. If the BOM is missing, use 'utf-16-le':

>>> sample = b'a\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> sample.decode('utf-16-le')
'analyzin'
>>> import codecs
>>> (codecs.BOM_UTF16_LE + sample)
b'\xff\xfea\x00n\x00a\x00l\x00y\x00z\x00i\x00n\x00'
>>> (codecs.BOM_UTF16_LE + sample).decode('utf-16')
'analyzin'

Upvotes: 9

Jean Du Plessis
Jean Du Plessis

Reputation: 147

Have you tried :

 str(stdout_formatted).replace('\x00','')

?

Upvotes: -3

Related Questions