Tensorflow a2.0.0: Converting CSV to a tfrecord, creating a Keras model that uses pipelined data from a large source, storing weights to a CSV file?

Question

I am learning machine learning from Andrew NG's lectures on Coursera. The course uses Matlab, which is great for understanding and prototyping machine learning models, but it is rather slow. I am currently researching Tensorflow since it supports GPU utilization and data pipelining, which should speed up my models.

However, I am completely lost on this one. The documentation does not go into detail, the sample codes are not commented, and to top it all off, Tensorflow just released an Alpha2.0 which changes the API significantly (so many old StackOverflow threads don't help).

My goals are:

Convert a large(10GB+) CSV file to tfrecords (found somewhere that this is beneficial?)
Create a ks.dataset that reads the data in multiple threads and pipelines it to the model
Create a model that learns from said dataset using my GPU
Export learned parameters to a file

Right now, I've only been able to build the keras model

model = keras.Sequential([
    keras.layers.Conv2D(filters=3, activation='relu',
                        kernel_regularizer=keras.regularizers.l2(0.001),
                        kernel_size=28,
                        padding="same",
                        input_shape=(28, 28, 1)),
    keras.layers.Flatten(),
    keras.layers.Dropout(0.09),
    keras.layers.Dense(10, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
    keras.layers.Dropout(0.09)])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=47, batch_size=256)
test_loss, test_acc = model.evaluate(x_test, y_test)
print('
Test accuracy:', test_acc)

Anything would be helpful at this point! What functions should I look into that would be crucial for any of my goals?

Mohamed AlKamali · Accepted Answer

After 24 hours of nonstop research, I finally glued all the pieces to the puzzle. The API is amazing, but the documentation is lacking.

For converting a CSV to tfrecord:

import tensorflow as tf
import numpy as np
import pandas as pd # For reading .csv
from datetime import datetime # For knowing how long does each read/write take

def _bytes_feature(value):
    # Returns a bytes_list from a string / byte.
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy()  # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def _float_feature(value):
    # Returns a float_list from a float / double.
    # If a list of values was passed, a float list feature with the entire list will be returned
    if isinstance(value, list):
        return tf.train.Feature(float_list=tf.train.FloatList(value=value))

    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))


def _int64_feature(value):
    # Returns an int64_list from a bool / enum / int / uint.
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def serialize_example(pandabase):
    # Serializes inputs from a pandas dataset (read in chunks)

    # Creates a mapping of the features from the header row of the file
    base_chunk = pandabase.get_chunk(0)
    num_features = len(base_chunk.columns)
    features_map = {}
    for i in range(num_features):
        features_map.update({'feature' + str(i): _float_feature(0)})

    # Set writing options with compression
    options = tf.io.TFRecordOptions(compression_type=tf.io.TFRecordCompressionType.ZLIB,
                                    compression_level=9)
    with tf.io.TFRecordWriter('test2.tfrecord.zip', options=options) as writer:
    # Convert the chunk to a numpy array, and write each row to the file in a double for loop
        for chunk in pandabase:
            nump = chunk.to_numpy()
            for row in nump:
                ii = 0
                for elem in row:
                    features_map['feature' + str(ii)] = _float_feature(float(elem))
                    ii += 1
                myProto = tf.train.Example(features=tf.train.Features(feature=features_map))
                writer.write(myProto.SerializeToString())


start = datetime.now()
bk1 = pd.read_csv("Book2.csv", chunksize=2048, engine='c', iterator=True)    
serialize_example(bk1)
end = datetime.now()
print("- consumed time: %ds" % (end-start).seconds)

For machine learning from the tfrecords and using GPU: Follow this guide for the correct setup then use this code:

# Recreate the feature mappings (Must be similar to the one used to write the tfrecords)
_NUMCOL = 5
feature_description = {}
for i in range(_NUMCOL):
    feature_description.update({'feature' + str(i): tf.io.FixedLenFeature([], tf.float32)})

# Parse the tfrecords into the form (x, y) or (x, y, weights) to be used with keras
def _parse_function(example_proto):
    dic = tf.io.parse_single_example(example_proto, feature_description)
    y = dic['feature0']
    x = tf.stack([dic['feature1'],
                   dic['feature2'],
                   dic['feature3'],
                   dic['feature4']], axis=0)
    return x, y

# Let tensorflow autotune the training speed
AUTOTUNE = tf.data.experimental.AUTOTUNE
# creat a tfdataset from the recorded file, set parallel reads to number of cores for best running speed
myData = tf.data.TFRecordDataset('test.tfrecord.zip', compression_type='ZLIB',
                                 num_parallel_reads=2)
# Map the data to a form useable by keras (using _parse_function), cache the data, shuffle, and read the data in batches  
myData = myData.map(_parse_function, num_parallel_calls=AUTOTUNE)
myData = myData.cache()
myData = myData.shuffle(buffer_size=8192)
batches = 16385
myData = myData.batch(batches).prefetch(buffer_size=AUTOTUNE)

model = keras.Sequential([
    keras.layers.Dense(100, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
    keras.layers.Dense(10, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
    keras.layers.Dense(1, activation='linear', kernel_regularizer=keras.regularizers.l2(lambd))])

model.compile(optimizer='adam',
              loss='mean_squared_error')
model.save('keras.HD5F')

Tensorflow a2.0.0: Converting CSV to a tfrecord, creating a Keras model that uses pipelined data from a large source, storing weights to a CSV file?

Answers (1)

Related Questions