Generating a 2-channel wave file from two independent streams of audio data

Question

I am streaming audio data from two clients on a network into a common server software, that needs to take said audio data and combine it into a two-channel wave file. Both client and server are software written by me.

I'm struggling with how to combine this on the server-side, and a key metric in the output wave file is being able to recreate the timestamps with which the users talked. What I want to do is output each client (there are only ever 2 per wave file) into a 2-channel stereo wave file.

How do I handle a situation like this properly? Do the clients need to change to stream audio data differently? Also, what do you recommend as an approach for dealing with the pauses in the audio stream i.e capturing the delays between users pressing the push-to-talk button when no messages are coming to the server?

Currently, the client software is using pyaudio to record from the default input device and issending individual frames over the network using TCP/IP. One message per frame. The clients work in a push-to-talk fashion, and only send audio data when the push-to-talk button is being held, otherwise no messages are sent.

I've done a decent bit of research into the WAVE file format and I understand that to do this I will need to interleave the samples from each channel for every frame written, which is where my main source of confusion comes from. Due to the dynamic nature of this environment as well as the synchronous approach of processing the audio data on the server side, most of the time I won't have data from both clients at once, but if I do I won't have a good logical mechanism to tell the server to write both frames together at once.

Here is what I have so far for processing audio from clients. One instance of this class is created for each client and thus a separate wave file is created for every client, which isn't what I want.

class AudioRepository(object):
    def __init__(self, root_directory, test_id, player_id):
        self.test_id = test_id
        self.player_id = player_id

        self.audio_filepath = os.path.join(root_directory, "{0}_{1}_voice_chat.wav".format(test_id, player_id))
        self.audio_wave_writer = wave.open(self.audio_filepath, "wb")
        self.audio_wave_writer.setnchannels(1)
        self.audio_wave_writer.setframerate(44100)
        self.audio_wave_writer.setsampwidth(
            pyaudio.get_sample_size(pyaudio.paInt16))
        self.first_audio_record = True
        self.previous_audio_time = datetime.datetime.now()

    def write(self, record: Record):
        now = datetime.datetime.now()
        time_passed_since_last = now - self.previous_audio_time
        number_blank_frames = int(44100 * time_passed_since_last.total_seconds())
        blank_data = b"\0\0" * number_blank_frames
        if not self.first_audio_record and time_passed_since_last.total_seconds() >= 1:
            self.audio_wave_writer.writeframes(blank_data)
        else:
            self.first_audio_record = False

        self.audio_wave_writer.writeframes(
            record.additional_data["audio_data"])
        self.previous_audio_time = datetime.datetime.now()

    def close(self):
        self.audio_wave_writer.close()

I typed this up because the code is on a machine without internet access, so sorry if the formatting is messed up and/or typos.

This also demonstrates what i'm currently doing to handle the time in between transmissions which works moderately well. The rate limiting thing is a hack and does cause problems, but I think i have a real solution for that. The clients send messages when the users presses and releases the push to talk button, so I can use those as flags to pause the output of blank frames so long as the user is sending me real audio data (which was the real problem, when users were sending audio data I was putting in a bunch of little tiny pauses which made the audio choppy).

The expected solution is to make the code above no longer be tied down to a single player id, and instead write will be called with records from both clients of the server (but still will be one from each player individually, not together) and interleave the audio data from each into a 2-channel wave file, with each player on a separate channel. I'm just looking for suggestions on how to handle the details of this. My initial thoughts are a thread and two queues of audio frames from each client will need to be involved, but i'm still iffy on how to combine it all into the wave file and make it sound proper and be timed right.

TheKewlStore · Accepted Answer

I managed to solve this using pydub, posting my solution here in case someone else stumbles upon this. I overcame the problem of keeping accurate timestamps using silence as mentioned in the original post, by tracking the transmission start and end events that the client software was already sending.

class AudioRepository(Repository):
    def __init__(self, test_id, board_sequence):
        Repository.__init__(self, test_id, board_sequence)

        self.audio_filepath = os.path.join(self.repository_directory, "{0}_voice_chat.wav".format(test_id))
        self.player1_audio_segment = AudioSegment.empty()
        self.player2_audio_segment = AudioSegment.empty()

        self.player1_id = None
        self.player2_id = None

        self.player1_last_record_time = datetime.datetime.now()
        self.player2_last_record_time = datetime.datetime.now()

    def write_record(self, record: Record):
        player_id = record.additional_data["player_id"]

        if record.event_type == Record.VOICE_TRANSMISSION_START:
            if self.is_player1(player_id):
                time_elapsed = datetime.datetime.now() - self.player1_last_record_time
                segment = AudioSegment.silent(time_elapsed.total_seconds() * 1000)
                self.player1_audio_segment += segment
            elif self.is_player2(player_id):
                time_elapsed = datetime.datetime.now() - self.player2_last_record_time
                segment = AudioSegment.silent(time_elapsed.total_seconds() * 1000)
                self.player2_audio_segment += segment
        elif record.event_type == Record.VOICE_TRANSMISSION_END:
            if self.is_player1(player_id):
                self.player1_last_record_time = datetime.datetime.now()
            elif self.is_player2(player_id):
                self.player2_last_record_time = datetime.datetime.now()

        if not record.event_type == Record.VOICE_MESSAGE_SENT:
            return

        frame_data = record.additional_data["audio_data"]
        segment = AudioSegment(data=frame_data, sample_width=2, frame_rate=44100, channels=1)

        if self.is_player1(player_id):
            self.player1_audio_segment += segment
        elif self.is_player2(player_id):
            self.player2_audio_segment += segment

    def close(self):
        Repository.close(self)

        # pydub's AudioSegment.from_mono_audiosegments expects all the segments given to be of the same frame count.
        # To ensure this, we check each segment's length and pad with silence as necessary.
        player1_frames = self.player1_audio_segment.frame_count()
        player2_frames = self.player2_audio_segment.frame_count()
        frames_needed = abs(player1_frames - player2_frames)
        duration = frames_needed / 44100
        padding = AudioSegment.silent(duration * 1000, frame_rate=44100)

        if player1_frames > player2_frames:
           self.player2_audio_segment += padding
        elif player1_frames < player2_frames:
            self.player1_audio_segment += padding

        stereo_segment = AudioSegment.from_mono_audiosegments(self.player1_audio_segment, self.player2_audio_segment)
        stereo_segment.export(self.audio_filepath, format="wav")

This way I keep the two audio segments as independent audio segments throughout the session, and combine them into one stereo segment that is then exported to the wav file of the repository. pydub also made keeping track of the silent segments easier, because I still don't think I really understand how audio "frames" work and how to generate the right amount of frames for a specific duration of silence. Nonetheless, pydub certainly does and takes care of it for me!

Generating a 2-channel wave file from two independent streams of audio data

Answers (1)

Related Questions