Reputation: 43
I am learning machine learning from Andrew NG's lectures on Coursera. The course uses Matlab, which is great for understanding and prototyping machine learning models, but it is rather slow. I am currently researching Tensorflow since it supports GPU utilization and data pipelining, which should speed up my models.
However, I am completely lost on this one. The documentation does not go into detail, the sample codes are not commented, and to top it all off, Tensorflow just released an Alpha2.0 which changes the API significantly (so many old StackOverflow threads don't help).
My goals are:
Right now, I've only been able to build the keras model
model = keras.Sequential([
keras.layers.Conv2D(filters=3, activation='relu',
kernel_regularizer=keras.regularizers.l2(0.001),
kernel_size=28,
padding="same",
input_shape=(28, 28, 1)),
keras.layers.Flatten(),
keras.layers.Dropout(0.09),
keras.layers.Dense(10, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
keras.layers.Dropout(0.09)])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=47, batch_size=256)
test_loss, test_acc = model.evaluate(x_test, y_test)
print('\nTest accuracy:', test_acc)
Anything would be helpful at this point! What functions should I look into that would be crucial for any of my goals?
Upvotes: 2
Views: 2138
Reputation: 43
After 24 hours of nonstop research, I finally glued all the pieces to the puzzle. The API is amazing, but the documentation is lacking.
For converting a CSV to tfrecord:
import tensorflow as tf
import numpy as np
import pandas as pd # For reading .csv
from datetime import datetime # For knowing how long does each read/write take
def _bytes_feature(value):
# Returns a bytes_list from a string / byte.
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
# Returns a float_list from a float / double.
# If a list of values was passed, a float list feature with the entire list will be returned
if isinstance(value, list):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def _int64_feature(value):
# Returns an int64_list from a bool / enum / int / uint.
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def serialize_example(pandabase):
# Serializes inputs from a pandas dataset (read in chunks)
# Creates a mapping of the features from the header row of the file
base_chunk = pandabase.get_chunk(0)
num_features = len(base_chunk.columns)
features_map = {}
for i in range(num_features):
features_map.update({'feature' + str(i): _float_feature(0)})
# Set writing options with compression
options = tf.io.TFRecordOptions(compression_type=tf.io.TFRecordCompressionType.ZLIB,
compression_level=9)
with tf.io.TFRecordWriter('test2.tfrecord.zip', options=options) as writer:
# Convert the chunk to a numpy array, and write each row to the file in a double for loop
for chunk in pandabase:
nump = chunk.to_numpy()
for row in nump:
ii = 0
for elem in row:
features_map['feature' + str(ii)] = _float_feature(float(elem))
ii += 1
myProto = tf.train.Example(features=tf.train.Features(feature=features_map))
writer.write(myProto.SerializeToString())
start = datetime.now()
bk1 = pd.read_csv("Book2.csv", chunksize=2048, engine='c', iterator=True)
serialize_example(bk1)
end = datetime.now()
print("- consumed time: %ds" % (end-start).seconds)
For machine learning from the tfrecords and using GPU: Follow this guide for the correct setup then use this code:
# Recreate the feature mappings (Must be similar to the one used to write the tfrecords)
_NUMCOL = 5
feature_description = {}
for i in range(_NUMCOL):
feature_description.update({'feature' + str(i): tf.io.FixedLenFeature([], tf.float32)})
# Parse the tfrecords into the form (x, y) or (x, y, weights) to be used with keras
def _parse_function(example_proto):
dic = tf.io.parse_single_example(example_proto, feature_description)
y = dic['feature0']
x = tf.stack([dic['feature1'],
dic['feature2'],
dic['feature3'],
dic['feature4']], axis=0)
return x, y
# Let tensorflow autotune the training speed
AUTOTUNE = tf.data.experimental.AUTOTUNE
# creat a tfdataset from the recorded file, set parallel reads to number of cores for best running speed
myData = tf.data.TFRecordDataset('test.tfrecord.zip', compression_type='ZLIB',
num_parallel_reads=2)
# Map the data to a form useable by keras (using _parse_function), cache the data, shuffle, and read the data in batches
myData = myData.map(_parse_function, num_parallel_calls=AUTOTUNE)
myData = myData.cache()
myData = myData.shuffle(buffer_size=8192)
batches = 16385
myData = myData.batch(batches).prefetch(buffer_size=AUTOTUNE)
model = keras.Sequential([
keras.layers.Dense(100, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
keras.layers.Dense(10, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
keras.layers.Dense(1, activation='linear', kernel_regularizer=keras.regularizers.l2(lambd))])
model.compile(optimizer='adam',
loss='mean_squared_error')
model.save('keras.HD5F')
Upvotes: 2