Zach
Zach

Reputation: 41

AudioSet and Tensorflow Understanding

With AudioSet released and providing a brand new area of research for those who do sound analysis for research, I've been really trying to dig deep these last few days on how to analyze and decode such data.

The data is given in .tfrecord files, heres a small snippet.

�^E^@^@^@^@^@^@C�bd
u
^[
^Hvideo_id^R^O

^KZZcwENgmOL0
^^
^Rstart_time_seconds^R^H^R^F
^D^@^@�C
^X
^Flabels^R^N^Z^L

�^B�^B�^B�^B�^B
^\
^Pend_time_seconds^R^H^R^F
^D^@^@�C^R�

�

^Oaudio_embedding^R�

�^A
�^A
�^A3�^] q^@�Z�r�����w���Q����.���^@�b�{m�^@P^@^S����,^]�x�����:^@����^@^@^Z0��^@]^Gr?v(^@^U^@��^EZ6�$
�^A

The example proto given is:

context: {
  feature: {
    key  : "video_id"
    value: {
      bytes_list: {
        value: [YouTube video id string]
      }
    }
  }
  feature: {
    key  : "start_time_seconds"
    value: {
      float_list: {
        value: 6.0
      }
    }
  }
  feature: {
    key  : "end_time_seconds"
    value: {
      float_list: {
        value: 16.0
      }
    }
  }
  feature: {
    key  : "labels"
      value: {
        int64_list: {
          value: [1, 522, 11, 172] # The meaning of the labels can be found here.
        }
      }
    }
}
feature_lists: {
  feature_list: {
    key  : "audio_embedding"
    value: {
      feature: {
        bytes_list: {
          value: [128 8bit quantized features]
        }
      }
      feature: {
        bytes_list: {
          value: [128 8bit quantized features]
        }
      }
    }
    ... # Repeated for every second of the segment
  }

}

My very direct question here - something that I can't seem to find good information on is - how do I convert cleanly between the two?

If I have a machine readable file, how to make it human readable, as well as the other way around.

I have found this which takes a tfrecord of a picture and converts it to a readable format... but I can't seem to get it to a form that works with AudioSet

Upvotes: 4

Views: 1748

Answers (4)

BitWhyz
BitWhyz

Reputation: 146

This is what I have done so far. The prepare_serialized_examples is from the youtube-8m starter code. I hope that helps :)

import tensorflow as tf

feature_names = 'audio_embedding'

def prepare_serialized_examples(serialized_example, max_quantized_value=2, min_quantized_value=-2):

    contexts, features = tf.parse_single_sequence_example(
        serialized_example,
        context_features={"video_id": tf.FixedLenFeature([], tf.string),
                          "labels": tf.VarLenFeature(tf.int64)},
        sequence_features={'audio_embedding' : tf.FixedLenSequenceFeature([10], dtype=tf.string)
    })

    decoded_features = tf.reshape(
    tf.cast(tf.decode_raw(features['audio_embedding'], tf.uint8), tf.float32),
    [-1, 128])

    return contexts, decoded_features


filename = '/audioset_v1_embeddings/bal_train/2a.tfrecord'
filename_queue = tf.train.string_input_producer([filename], num_epochs=1)

reader = tf.TFRecordReader()

with tf.Session() as sess:
    
    _, serialized_example = reader.read(filename_queue)
    context, features = prepare_serialized_examples_(serialized_example)

    init_op = tf.initialize_all_variables()
    sess.run(init_op)

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    print(sess.run(features))

    coord.request_stop()
    coord.join(threads)

Upvotes: 2

jerpint
jerpint

Reputation: 409

this worked for me, storing the features in feat_audio. to plot them, convert them to an ndarray and reshape them accordingly.

audio_record = '/audioset_v1_embeddings/eval/_1.tfrecord'
vid_ids = []
labels = []
start_time_seconds = [] # in secondes
end_time_seconds = []
feat_audio = []
count = 0
for example in tf.python_io.tf_record_iterator(audio_record):
    tf_example = tf.train.Example.FromString(example)
    #print(tf_example)
    vid_ids.append(tf_example.features.feature['video_id'].bytes_list.value[0].decode(encoding='UTF-8'))
    labels.append(tf_example.features.feature['labels'].int64_list.value)
    start_time_seconds.append(tf_example.features.feature['start_time_seconds'].float_list.value)
    end_time_seconds.append(tf_example.features.feature['end_time_seconds'].float_list.value)

    tf_seq_example = tf.train.SequenceExample.FromString(example)
    n_frames = len(tf_seq_example.feature_lists.feature_list['audio_embedding'].feature)

    sess = tf.InteractiveSession()
    rgb_frame = []
    audio_frame = []
    # iterate through frames
    for i in range(n_frames):
        audio_frame.append(tf.cast(tf.decode_raw(
                tf_seq_example.feature_lists.feature_list['audio_embedding'].feature[i].bytes_list.value[0],tf.uint8)
                       ,tf.float32).eval())

    sess.close()
    feat_audio.append([])

    feat_audio[count].append(audio_frame)
    count+=1

Upvotes: 4

Jort Gemmeke
Jort Gemmeke

Reputation: 1

The YouTube-8M starter code should work with the AudioSet tfrecord files out of the box.

Upvotes: 0

Mark McDonald
Mark McDonald

Reputation: 8210

The AudioSet data is not a tensorflow.Example protobuf, like the image example you have linked. It's a SequenceExample.

I've not tested but you should be able to use the code you linked if you replace tf.parse_single_example with tf.parse_single_sequence_example (and replacing the field names).

Upvotes: 1

Related Questions