Peter Till
Peter Till

Reputation: 109

How can I decrypt voice data received from Discord?

I'm trying to decrypt the voice data received from Discord that uses xsalsa20_poly1305 encryption mode. My goal would be to record & use audio to chat with AI.

What could I do wrong? Thanks for any help!

My code:

async def record_audio(udp_socket, ssrc, secret_key):
    box = nacl.secret.SecretBox(bytes(secret_key)) # TODO: Fix decryption xsalsa20_poly1305
    print("Listening for audio data...")

    try:
        response, _ = udp_socket.recvfrom(74)
        print(f"Received response: {response}")
        # Process the response...
    except socket.timeout:
        print("IP discovery timeout")
    except Exception as e:
        print(f"Unexpected error during IP discovery: {e}")
        return None, None

    while True:
        print("Waiting for audio data...")
        try:
            ready, _, _ = select.select([udp_socket], [], [], 5.0)
            if udp_socket in ready:
                data, addr = udp_socket.recvfrom(65536)  # Adjust buffer size as necessary
                print(f"Received {len(data)} bytes from {addr}: {data.hex()}")

                if len(data) > 12:

                    # Extract the RTP header
                    header = data[:12]

                    # Construct the nonce
                    nonce = header + b'\x00' * 12

                    print(f"Nonce: {len(nonce)} bytes")

                    # Get the encrypted audio data
                    encrypted = data[12:]

                    print(f"Encrypted audio data: {len(encrypted)} bytes")


                    #The rest of the data is the encrypted audio data (Should be 48 - 24 = 24 bytes)


                    #nonce  = data[:12]

                    #print(f"Nonce: {nonce}")

                    #if len(nonce) < 12:
                    #    nonce.ljust(24, b'\x00')
                    #remaining 12 bytes can be zeros or another fixed pattern
                    #nonce = nonce_part + bytes(12)
                    #copy the RTP header to get the nonce
                    #nonce = bytearray(24)
                    #nonce[:12] = data[:12]#data[:12]

                    #get the encrypted audio data
                    #encrypted = data[12:]
                    print(f"Encrypted audio data: {bytes(encrypted)}")
                    try:
                        audio_data = box.decrypt(bytes(data), bytes(nonce))
                        print("Received audio data")
                    except Exception as e:
                        print(f"Decryption error: {e}")
        except Exception as e:
            print(f"Error receiving audio data: {e}")
            break

Edit: The secret key is passed directly from the opcode 4 object received from Discord.

Output:

Waiting for audio data...
Received 48 bytes from ('66.22.243.22', 50023): 81c9000700013adfaf6439133ca81bfcd2b35eb743f2a4af0165e3cf0517d8efee5dae36ec6653c88a2d625064af33d6
Nonce: 24 bytes
Nonce: b'\x81\xc9\x00\x07\x00\x01:\xdf\xafd9\x13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Encrypted audio data: 36 bytes
Encrypted audio data: b'<\xa8\x1b\xfc\xd2\xb3^\xb7C\xf2\xa4\xaf\x01e\xe3\xcf\x05\x17\xd8\xef\xee]\xae6\xecfS\xc8\x8a-bPd\xaf3\xd6'
Decryption error: Decryption failed. Ciphertext failed verification

I can't include my Discord token for security reasons but there are some testing data:

{'op': 4, 'd': {'video_codec': 'H264', 'secure_frames_version': 0, 'secret_key': [20, 115, 239, 10, 206, 186, 11, 248, 52, 47, 193, 69, 170, 89, 146, 187, 215, 181, 4, 177, 173, 132, 50, 212, 141, 194, 52, 217, 219, 17, 111, 5], 'mode': 'xsalsa20_poly1305', 'media_session_id': '77d90ef5c4aa124c0dcd6d39bbe88f9f', 'audio_codec': 'opus'}}
Udp socket:  <socket.socket fd=604, family=2, type=2, proto=0, laddr=('0.0.0.0', 56866)>
SSRC:  112825
Secret key:  [20, 115, 239, 10, 206, 186, 11, 248, 52, 47, 193, 69, 170, 89, 146, 187, 215, 181, 4, 177, 173, 132, 50, 212, 141, 194, 52, 217, 219, 17, 111, 5]
Received data: b'\x81\xc9\x00\x07\x00\x00gIZ.Y\xaf\xf8\x94\xb4a}?gm"\xc6R\x02\\\x13\xaf>@\xf0\xe8\xca\xd0\x90\xf3\x16\x89h\x14\x81s\xa0\x00\xf3$v\x99|'

Everything in bytearrays:

Data: [129, 201, 0, 7, 0, 1, 87, 149, 132, 179, 156, 19, 161, 42, 79, 112, 160, 142, 72, 43, 68, 43, 225, 201, 66, 97, 38, 88, 120, 123, 192, 102, 18, 163, 126, 210, 96, 21, 113, 212, 66, 63, 102, 7, 123, 24, 141, 1]
RTP header: [129, 201, 0, 7, 0, 1, 87, 149, 132, 179, 156, 19]
Nonce: [129, 201, 0, 7, 0, 1, 87, 149, 132, 179, 156, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Voice data: [161, 42, 79, 112, 160, 142, 72, 43, 68, 43, 225, 201, 66, 97, 38, 88, 120, 123, 192, 102, 18, 163, 126, 210, 96, 21, 113, 212, 66, 63, 102, 7, 123, 24, 141, 1]

Upvotes: 0

Views: 172

Answers (1)

Peter Till
Peter Till

Reputation: 109

So I have found the solution. When no one is in the voice channel, Discord seemingly sends a lot of rubbish data followed by five packages of silence (0xF8, 0xFF, 0xFE) then rubbish, undecryptable data again. In order to receive some XSalsa20_Poly1305 compatible bytes, someone has to be in the voice channel and must speak or make any sound. I initially didn't test it while in voice because I thought that silence will be in the same format as voice and if I decrypt it, I'll be good to go...

By the way, thanks Topaco for pointing out some key issues with my implementation and testing stuff as well!

Upvotes: 0

Related Questions