Reputation: 4402
I'm trying to do Stanfords CS20: TensorFlow for Deep Learning Research course. The first 2 lectures provide a good introduction to the low level plumbing and computation framework (that frankly the official introductory tutorials seem to skip right over for reasons I can only fathom as sadism). In lecture 3, it starts performing a linear regression and makes what seems like a fairly heavy cognitive leap for me. Instead of
on a tensor computation, it does it on the GradientDescentOptimizer., feed_dict={X: x, Y:y})
The full code is available on page 3 of the lecture 3 notes.
EDIT: code and data also available at this github - code is available in examples/
and data in examples/data/birth_life_2010.txt
EDIT: code is below as per request
import tensorflow as tf
import utils
DATA_FILE = "data/birth_life_2010.f[txt"
# Step 1: read in data from the .txt file
# data is a numpy array of shape (190, 2), each row is a datapoint
data, n_samples = utils.read_birth_life_data(DATA_FILE)
# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')
# Step 3: create weight and bias, initialized to 0
w = tf.get_variable('weights', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))
# Step 4: construct model to predict Y (life expectancy from birth rate)
Y_predicted = w * X + b
# Step 5: use the square error as the loss function
loss = tf.square(Y - Y_predicted, name='loss')
# Step 6: using gradient descent with learning rate of 0.01 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
with tf.Session() as sess:
# Step 7: initialize the necessary variables, in this case, w and b
# Step 8: train the model
for i in range(100): # run 100 epochs
for x, y in data:
# Session runs train_op to minimize loss, feed_dict={X: x, Y:y})
# Step 9: output the values of w and b
w_out, b_out =[w, b])
I've done the coursera machine learning course course so I (think) I understand the notion of Gradient Descent. But I'm quite lost as to what is happening in this specific case.
What I would expect to have to happen:
I understand that in practice you'd apply things like batching and subsets but in this case I believe this is just looping over the entire dataset 100 times.
I can (and have) implemented this before. But I'm struggling to fathom how the code above could be achieving this. For one thing is the optimizer is called on each data point (i.e. it's in an inner loop of the 100 epochs and then each data point). I would have expected an optimization call which took in the entire dataset.
Question 1 - is the gradient adjustment operating over the entire data set 100 times, or over the entire data set 100 times in batches of 1 (so 100*n times, for n examples)?
Question 2 - how does the optimizer 'know' how to to adjust w and b? It's only provided the loss tensor - is it reading back through the graph and just going "well, w and b are the only variables, so I'll wiggle the hell out of those"
Question 2b - if so, what happens if you put in other variables? Or more complex functions? Does it just auto-magically calculate gradient adjustment for every variable in the predecessor graph**
Question 2c - pursuant to that I've tried adjusting to a quadratic expression as suggested in page 3 of the tutorial but end up getting a higher loss. Is this normal? The tutorial seems to suggest it should be better. At the least I would expect it not to be worse - is this subject to changing hyperparameters?
EDIT: Full code for my attempts to adjust to quadratic are here. Not that this is the same as the above with lines 28, 29, 30 and 34 modified to use a quadratic predictor. These edits are (what I interpret) to be what's suggested in the lecture 3 notes on page 4
""" Solution for simple linear regression example using placeholders
Created by Chip Huyen ([email protected])
CS20: "TensorFlow for Deep Learning Research"
Lecture 03
import os
import time
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import utils
DATA_FILE = 'data/birth_life_2010.txt'
# Step 1: read in data from the .txt file
data, n_samples = utils.read_birth_life_data(DATA_FILE)
# Step 2: create placeholders for X (birth rate) and Y (life expectancy)
X = tf.placeholder(tf.float32, name='X')
Y = tf.placeholder(tf.float32, name='Y')
# Step 3: create weight and bias, initialized to 0
# w = tf.get_variable('weights', initializer=tf.constant(0.0)) old single weight
w = tf.get_variable('weights_1', initializer=tf.constant(0.0))
u = tf.get_variable('weights_2', initializer=tf.constant(0.0))
b = tf.get_variable('bias', initializer=tf.constant(0.0))
# Step 4: build model to predict Y
#Y_predicted = w * X + b #linear
Y_predicted = w * X * X + X * u + b #quadratic
#Y_predicted = w # test of nonsense
# Step 5: use the squared error as the loss function
# you can use either mean squared error or Huber loss
loss = tf.square(Y - Y_predicted, name='loss')
#loss = utils.huber_loss(Y, Y_predicted)
# Step 6: using gradient descent with learning rate of 0.001 to minimize loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
start = time.time()
writer = tf.summary.FileWriter('./graphs/linear_reg', tf.get_default_graph())
with tf.Session() as sess:
# Step 7: initialize the necessary variables, in this case, w and b
# Step 8: train the model for 100 epochs
for i in range(100):
total_loss = 0
for x, y in data:
# Session execute optimizer and fetch values of loss
_, l =[optimizer, loss], feed_dict={X: x, Y:y})
total_loss += l
print('Epoch {0}: {1}'.format(i, total_loss/n_samples))
# close the writer when you're done using it
# Step 9: output the values of w and b
w_out, b_out =[w, b])
print('Took: %f seconds' %(time.time() - start))
print(f'w = {w_out}')
# plot the results
plt.plot(data[:,0], data[:,1], 'bo', label='Real data')
plt.plot(data[:,0], data[:,0] * w_out + b_out, 'r', label='Predicted data')
For the linear predictor I get loss of (this aligns with lecture notes):
Epoch 99: 30.03552558278714
For my attempts at the quadratic I get loss of:
Epoch 99: 127.2992221294363
Upvotes: 7
Views: 6935
Reputation: 10474
is a single input). I.e. compute the gradient of the loss with respect to a single example, update the parameters, go to the next example... until you went over the whole dataset. Do this 100 times.minimize
call of the optimizer. Indeed, you only put in the cost: Under the hood, Tensorflow will then compute gradients for all requested variables (we'll get to that in a second) that are involved in the cost computation (it can infer this from the computational graph) and return an op that "applies" the gradients. This means an op that takes all the requested variables and assigns a new value to them, something like tf.assign(var, var - learning_rate*gradient)
. This is related to another question you asked: minimize
returns just an op, this doesn't do anything! Running this op in a session will do a "gradient step" each time.As to which variables are actually affected by this op: You can give this as an argument to the minimize
call! See here -- the argument is var_list
. If this is not given, Tensorflow will simply use all "trainable variables". By default, any variable you create with tf.Variable
or tf.get_variable
is trainable. However you can pass trainable=False
to these functions to create variables that are not (by default) going to be affected by the op returned by minimize
. Play around with this! See what happens if you set some variables not to be trainable, or if you pass a custom var_list
to minimize
In general, the "whole idea" of Tensorflow is that it can "magically" calculate gradients based on only a feedforward description of the model.
EDIT: This is possible because machine learning models (including deep learning) are composed of quite simple building blocks such as matrix multiplications and mostly pointwise nonlinearities. These simple blocks also have simple derivatives, which can be composed via the chain rule. You might want to read up on the backpropagation algorithm.
It will certainly take longer with very large models. But it is always possible as long as there is a clear "path" through the computation graph where all components have defined derivatives.
As to whether this can generate poor models: Yes, and this is a fundamental problem of deep learning. Very complex/deep models lead to highly non-convex cost functions which are difficult to optimize with methods like gradient descent.
With regards to the quadratic function: Looks like there are two problems here.
Upvotes: 9