Reputation: 41
With AudioSet released and providing a brand new area of research for those who do sound analysis for research, I've been really trying to dig deep these last few days on how to analyze and decode such data.
The data is given in .tfrecord files, heres a small snippet.
�^E^@^@^@^@^@^@C�bd
u
^[
^Hvideo_id^R^O
^KZZcwENgmOL0
^^
^Rstart_time_seconds^R^H^R^F
^D^@^@�C
^X
^Flabels^R^N^Z^L
�^B�^B�^B�^B�^B
^\
^Pend_time_seconds^R^H^R^F
^D^@^@�C^R�
�
^Oaudio_embedding^R�
�^A
�^A
�^A3�^] q^@�Z�r�����w���Q����.���^@�b�{m�^@P^@^S����,^]�x�����:^@����^@^@^Z0��^@]^Gr?v(^@^U^@��^EZ6�$
�^A
The example proto given is:
context: {
feature: {
key : "video_id"
value: {
bytes_list: {
value: [YouTube video id string]
}
}
}
feature: {
key : "start_time_seconds"
value: {
float_list: {
value: 6.0
}
}
}
feature: {
key : "end_time_seconds"
value: {
float_list: {
value: 16.0
}
}
}
feature: {
key : "labels"
value: {
int64_list: {
value: [1, 522, 11, 172] # The meaning of the labels can be found here.
}
}
}
}
feature_lists: {
feature_list: {
key : "audio_embedding"
value: {
feature: {
bytes_list: {
value: [128 8bit quantized features]
}
}
feature: {
bytes_list: {
value: [128 8bit quantized features]
}
}
}
... # Repeated for every second of the segment
}
}
My very direct question here - something that I can't seem to find good information on is - how do I convert cleanly between the two?
If I have a machine readable file, how to make it human readable, as well as the other way around.
I have found this which takes a tfrecord of a picture and converts it to a readable format... but I can't seem to get it to a form that works with AudioSet
Upvotes: 4
Views: 1748
Reputation: 146
This is what I have done so far. The prepare_serialized_examples is from the youtube-8m starter code. I hope that helps :)
import tensorflow as tf
feature_names = 'audio_embedding'
def prepare_serialized_examples(serialized_example, max_quantized_value=2, min_quantized_value=-2):
contexts, features = tf.parse_single_sequence_example(
serialized_example,
context_features={"video_id": tf.FixedLenFeature([], tf.string),
"labels": tf.VarLenFeature(tf.int64)},
sequence_features={'audio_embedding' : tf.FixedLenSequenceFeature([10], dtype=tf.string)
})
decoded_features = tf.reshape(
tf.cast(tf.decode_raw(features['audio_embedding'], tf.uint8), tf.float32),
[-1, 128])
return contexts, decoded_features
filename = '/audioset_v1_embeddings/bal_train/2a.tfrecord'
filename_queue = tf.train.string_input_producer([filename], num_epochs=1)
reader = tf.TFRecordReader()
with tf.Session() as sess:
_, serialized_example = reader.read(filename_queue)
context, features = prepare_serialized_examples_(serialized_example)
init_op = tf.initialize_all_variables()
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
print(sess.run(features))
coord.request_stop()
coord.join(threads)
Upvotes: 2
Reputation: 409
this worked for me, storing the features in feat_audio. to plot them, convert them to an ndarray and reshape them accordingly.
audio_record = '/audioset_v1_embeddings/eval/_1.tfrecord'
vid_ids = []
labels = []
start_time_seconds = [] # in secondes
end_time_seconds = []
feat_audio = []
count = 0
for example in tf.python_io.tf_record_iterator(audio_record):
tf_example = tf.train.Example.FromString(example)
#print(tf_example)
vid_ids.append(tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding='UTF-8'))
labels.append(tf_example.features.feature['labels'].int64_list.value)
start_time_seconds.append(tf_example.features.feature['start_time_seconds'].float_list.value)
end_time_seconds.append(tf_example.features.feature['end_time_seconds'].float_list.value)
tf_seq_example = tf.train.SequenceExample.FromString(example)
n_frames = len(tf_seq_example.feature_lists.feature_list['audio_embedding'].feature)
sess = tf.InteractiveSession()
rgb_frame = []
audio_frame = []
# iterate through frames
for i in range(n_frames):
audio_frame.append(tf.cast(tf.decode_raw(
tf_seq_example.feature_lists.feature_list['audio_embedding'].feature[i].bytes_list.value[0],tf.uint8)
,tf.float32).eval())
sess.close()
feat_audio.append([])
feat_audio[count].append(audio_frame)
count+=1
Upvotes: 4
Reputation: 1
The YouTube-8M starter code should work with the AudioSet tfrecord files out of the box.
Upvotes: 0
Reputation: 8210
The AudioSet data is not a tensorflow.Example protobuf, like the image example you have linked. It's a SequenceExample.
I've not tested but you should be able to use the code you linked if you replace tf.parse_single_example
with tf.parse_single_sequence_example
(and replacing the field names).
Upvotes: 1