Reputation: 109
I'm trying to decrypt the voice data received from Discord that uses xsalsa20_poly1305
encryption mode. My goal would be to record & use audio to chat with AI.
What could I do wrong? Thanks for any help!
My code:
async def record_audio(udp_socket, ssrc, secret_key):
box = nacl.secret.SecretBox(bytes(secret_key)) # TODO: Fix decryption xsalsa20_poly1305
print("Listening for audio data...")
try:
response, _ = udp_socket.recvfrom(74)
print(f"Received response: {response}")
# Process the response...
except socket.timeout:
print("IP discovery timeout")
except Exception as e:
print(f"Unexpected error during IP discovery: {e}")
return None, None
while True:
print("Waiting for audio data...")
try:
ready, _, _ = select.select([udp_socket], [], [], 5.0)
if udp_socket in ready:
data, addr = udp_socket.recvfrom(65536) # Adjust buffer size as necessary
print(f"Received {len(data)} bytes from {addr}: {data.hex()}")
if len(data) > 12:
# Extract the RTP header
header = data[:12]
# Construct the nonce
nonce = header + b'\x00' * 12
print(f"Nonce: {len(nonce)} bytes")
# Get the encrypted audio data
encrypted = data[12:]
print(f"Encrypted audio data: {len(encrypted)} bytes")
#The rest of the data is the encrypted audio data (Should be 48 - 24 = 24 bytes)
#nonce = data[:12]
#print(f"Nonce: {nonce}")
#if len(nonce) < 12:
# nonce.ljust(24, b'\x00')
#remaining 12 bytes can be zeros or another fixed pattern
#nonce = nonce_part + bytes(12)
#copy the RTP header to get the nonce
#nonce = bytearray(24)
#nonce[:12] = data[:12]#data[:12]
#get the encrypted audio data
#encrypted = data[12:]
print(f"Encrypted audio data: {bytes(encrypted)}")
try:
audio_data = box.decrypt(bytes(data), bytes(nonce))
print("Received audio data")
except Exception as e:
print(f"Decryption error: {e}")
except Exception as e:
print(f"Error receiving audio data: {e}")
break
Edit: The secret key is passed directly from the opcode 4 object received from Discord.
Output:
Waiting for audio data...
Received 48 bytes from ('66.22.243.22', 50023): 81c9000700013adfaf6439133ca81bfcd2b35eb743f2a4af0165e3cf0517d8efee5dae36ec6653c88a2d625064af33d6
Nonce: 24 bytes
Nonce: b'\x81\xc9\x00\x07\x00\x01:\xdf\xafd9\x13\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
Encrypted audio data: 36 bytes
Encrypted audio data: b'<\xa8\x1b\xfc\xd2\xb3^\xb7C\xf2\xa4\xaf\x01e\xe3\xcf\x05\x17\xd8\xef\xee]\xae6\xecfS\xc8\x8a-bPd\xaf3\xd6'
Decryption error: Decryption failed. Ciphertext failed verification
I can't include my Discord token for security reasons but there are some testing data:
{'op': 4, 'd': {'video_codec': 'H264', 'secure_frames_version': 0, 'secret_key': [20, 115, 239, 10, 206, 186, 11, 248, 52, 47, 193, 69, 170, 89, 146, 187, 215, 181, 4, 177, 173, 132, 50, 212, 141, 194, 52, 217, 219, 17, 111, 5], 'mode': 'xsalsa20_poly1305', 'media_session_id': '77d90ef5c4aa124c0dcd6d39bbe88f9f', 'audio_codec': 'opus'}}
Udp socket: <socket.socket fd=604, family=2, type=2, proto=0, laddr=('0.0.0.0', 56866)>
SSRC: 112825
Secret key: [20, 115, 239, 10, 206, 186, 11, 248, 52, 47, 193, 69, 170, 89, 146, 187, 215, 181, 4, 177, 173, 132, 50, 212, 141, 194, 52, 217, 219, 17, 111, 5]
Received data: b'\x81\xc9\x00\x07\x00\x00gIZ.Y\xaf\xf8\x94\xb4a}?gm"\xc6R\x02\\\x13\xaf>@\xf0\xe8\xca\xd0\x90\xf3\x16\x89h\x14\x81s\xa0\x00\xf3$v\x99|'
Everything in bytearrays:
Data: [129, 201, 0, 7, 0, 1, 87, 149, 132, 179, 156, 19, 161, 42, 79, 112, 160, 142, 72, 43, 68, 43, 225, 201, 66, 97, 38, 88, 120, 123, 192, 102, 18, 163, 126, 210, 96, 21, 113, 212, 66, 63, 102, 7, 123, 24, 141, 1]
RTP header: [129, 201, 0, 7, 0, 1, 87, 149, 132, 179, 156, 19]
Nonce: [129, 201, 0, 7, 0, 1, 87, 149, 132, 179, 156, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Voice data: [161, 42, 79, 112, 160, 142, 72, 43, 68, 43, 225, 201, 66, 97, 38, 88, 120, 123, 192, 102, 18, 163, 126, 210, 96, 21, 113, 212, 66, 63, 102, 7, 123, 24, 141, 1]
Upvotes: 0
Views: 172
Reputation: 109
So I have found the solution. When no one is in the voice channel, Discord seemingly sends a lot of rubbish data followed by five packages of silence (0xF8, 0xFF, 0xFE) then rubbish, undecryptable data again. In order to receive some XSalsa20_Poly1305 compatible bytes, someone has to be in the voice channel and must speak or make any sound. I initially didn't test it while in voice because I thought that silence will be in the same format as voice and if I decrypt it, I'll be good to go...
By the way, thanks Topaco for pointing out some key issues with my implementation and testing stuff as well!
Upvotes: 0