Reputation: 115
In one of my projects, I need to resample PCM audio data to a different sample rate. I am using javax.sound.sampled.AudioSystem for this task. The resampling seems to add additional samples at the beginning and end of the frame. Here is a minimal working example:
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.util.Arrays;
import javax.sound.sampled.AudioFormat;
import javax.sound.sampled.AudioInputStream;
import javax.sound.sampled.AudioSystem;
ublic class ResamplingTest {
public static void main(final String[] args) throws IOException {
final int nrOfSamples = 4;
final int bytesPerSample = 2;
final byte[] data = new byte[nrOfSamples * bytesPerSample];
Arrays.fill(data, (byte) 10);
final AudioFormat inputFormat = new AudioFormat(32000, bytesPerSample * 8, 1, true, false);
final AudioInputStream inputStream = new AudioInputStream(new ByteArrayInputStream(data), inputFormat, data.length);
final AudioFormat outputFormat = new AudioFormat(24000, bytesPerSample * 8, 1, true, false);
final AudioInputStream outputStream = AudioSystem.getAudioInputStream(outputFormat, inputStream);
final var resampledBytes = outputStream.readAllBytes();
System.out.println("Expected number of samples after resampling "
+ (int) (nrOfSamples * outputFormat.getSampleRate() / inputFormat.getSampleRate()));
System.out.println("Actual number of samples after resampling " + resampledBytes.length / bytesPerSample);
System.out.println(Arrays.toString(resampledBytes));
}
}
I would expect exactly 3 samples when resampling 4 samples from 32 kHz to 24 kHz. However, the above code generates 5 samples. The number of extra samples seems to depend on the input and output sample rate. For example, if I resample from 8 kHz to 32 kHz, 8 additional samples are generated. Why does resampling add additional samples, and how do I know how many samples are added at the beginning and end of a frame?
Upvotes: 3
Views: 1200
Reputation: 985
I've bumped into the same issue recently and made some research. Here is what I have found. The code responsible for re-sampling is located at:
In particular it is a class AudioFloatInputStreamResampler
and it's methods read
/readNextBuffer
. Those extra-bytes added on re-sampling indeed are used for padding for interpolation algorithm. It worth noting that there are several interpolation algorithms supported. It's possible to choose one using "interpolation" property of target format, i.e.:
AudioFormat targetAudioFormat = new AudioFormat(
AudioFormat.Encoding.PCM_SIGNED,
16000, 16, 1, 2, 16000, false,
Map.of("interpolation", "linear"))
The list of interpolation algorithms supported is hard-coded and consists of: linear
(same as linear2
), linear1
, linear2
(default), cubic
, lanczos
, sinc
and point
. Amount of padding bytes depends on the algorithm chosen. And linear
requires least amount of bytes added among other options. I.e. linear
algorithm requires padding of 2 bytes, whereas point
requires 100 bytes.
I don't know if leaving those padded bytes in final output is a bug or not. As to me it would be pretty okay to trim those padding bytes. At least zeroed ones.
In my case those extra bytes are especially weird due to the need to re-sample streaming audio. And initially I implemented my re-sampling with construction of audio streams per buffer. As a result I had a trade off between real-time processing and frequency of extra-bytes (sounded as clicks) depending on buffer size used. So basically I see two ways to deal with it:
Run conversion with constant buffer data and determine how the padded bytes are added. I.e. I have to re-sample 8kHz to 16kHz and vice versa. I've got a buffer filled with uniform values (i.e. 120 for 8-bit samples) and run the conversion. As a result I found that on downsampling there is a single zero byte added at the begging of the buffer and on upsampling there are 3 zero bytes and 1 interpolated to zero byte (60) in the beginning. However the last byte was also interpolated to zero (60). Based on these results I then trim extra bytes in my code.
Wrap the overall incoming/outgoing streaming audio data into InputStream/AudioInputStream subclasses. So the padding bytes are added only once per stream, which is not so crucial for sound quality and let to avoid trade-offs with real-time processing.
Upvotes: 0
Reputation: 7910
I was playing around with this. I don't really have an answer, just a couple thoughts. I suspect the streams are "padded" with beginning or ending zeroes for algorithmic purposes.
First off, this doesn't seem to make a difference, but your AudioInputStream
instantiation should be the number of frames, not the number of bytes.
I ran your program with just 1 byte per sample as it seems to make things clearer, with a value of 10 in each frame.
Original number of samples: 4
Expected number of samples after resampling 3
Actual number of samples after resampling 5
original data: [10, 10, 10, 10]
resampled data: [0, 3, 10, 10, 6]
Original number of samples: 5
Expected number of samples after resampling 3
Actual number of samples after resampling 6
original data: [10, 10, 10, 10, 10]
resampled data: [0, 3, 10, 10, 10, 3]
Original number of samples: 6
Expected number of samples after resampling 4
Actual number of samples after resampling 7
original data: [10, 10, 10, 10, 10, 10]
resampled data: [0, 3, 10, 10, 10, 10, 0]
Original number of samples: 7
Expected number of samples after resampling 5
Actual number of samples after resampling 7
original data: [10, 10, 10, 10, 10, 10, 10]
resampled data: [0, 3, 10, 10, 10, 10, 10]
Original number of samples: 8
Expected number of samples after resampling 6
Actual number of samples after resampling 8
original data: [10, 10, 10, 10, 10, 10, 10, 10]
resampled data: [0, 3, 10, 10, 10, 10, 10, 6]
Original number of samples: 9
Expected number of samples after resampling 6
Actual number of samples after resampling 9
original data: [10, 10, 10, 10, 10, 10, 10, 10, 10]
resampled data: [0, 3, 10, 10, 10, 10, 10, 10, 3]
Original number of samples: 10
Expected number of samples after resampling 7
Actual number of samples after resampling 10
original data: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
resampled data: [0, 3, 10, 10, 10, 10, 10, 10, 10, 0]
Original number of samples: 11
Expected number of samples after resampling 8
Actual number of samples after resampling 10
original data: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]
resampled data: [0, 3, 10, 10, 10, 10, 10, 10, 10, 10]
Maybe the algorithm is treating the input line as if there is a preceding 0 value and an ending 0 value. The latter seems more clearly in evidence.
If you look at the ends of lines 7, 8, and 9. In the first instance, I'm assuming the two sample rates "line up" in that the last point on the inputline is also a point on the output, not an "intermediate value". When the last point on the output line falls beyond the input signal, it looks like linear interpolation is used between the last inputline value and 0.
I'm not clear what is going on at the start, but it seems like the algorithm may also be coming up with a linear interpolation between 0 and the first inputline value, but I don't understand why it isn't a 0.6 instead of a 0.3 or why there is a leading zero.
For the most part, though, notice that we do have the predicted number of 10's! The exception is when the leading and ending partial values add up to 10 (less rounding, I'm assuming 3 should be 3.3 and 6 should be 6.7 if extended a decimal point--try putting in 100 instead of 10 and you will see), on lines 4 and 8.
I am also going to assume the transform algorithm was made with a use case in mind that there would be 1000's of samples, in which case one or two leading/ending additional values are not going to affect the sound meaningfully, especially given that they are ramping between the source signal and 0.
Upvotes: 1