Reputation: 3359
Here is the code that I am trying to run-
import tensorflow as tf
import numpy as np
import input_data
filename_queue = tf.train.string_input_producer(["cs-training.csv"])
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
record_defaults = [[1], [1], [1], [1], [1], [1], [1], [1], [1], [1], [1]]
col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11 = tf.decode_csv(
value, record_defaults=record_defaults)
features = tf.concat(0, [col2, col3, col4, col5, col6, col7, col8, col9, col10, col11])
with tf.Session() as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(1200):
# Retrieve a single instance:
print i
example, label = sess.run([features, col1])
try:
print example, label
except:
pass
coord.request_stop()
coord.join(threads)
This code return the error below.
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-23-e42fe2609a15> in <module>()
7 # Retrieve a single instance:
8 print i
----> 9 example, label = sess.run([features, col1])
10 try:
11 print example, label
/root/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in run(self, fetches, feed_dict)
343
344 # Run request and get response.
--> 345 results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
346
347 # User may have fetched the same tensor multiple times, but we
/root/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.pyc in _do_run(self, target_list, fetch_list, feed_dict)
417 # pylint: disable=protected-access
418 raise errors._make_specific_exception(node_def, op, e.error_message,
--> 419 e.code)
420 # pylint: enable=protected-access
421 raise e_type, e_value, e_traceback
InvalidArgumentError: Field 1 in record 0 is not a valid int32: 0.766126609
It the has a lot of information following it which I think is irrelevant to the problem. Obviously the problem is that a lot of the data that I am feeding to the program is not of the dtype int32. It's mostly float numbers. I've tried a few things to change the dtype like explicitly setting the dtype=float
argument in tf.decode_csv
as well as the tf.concat
. Neither worked. It's an invalid argument. To top it all off, I don't know if this code is going to actually make a prediction on the data. I want it to predict whether col1 is going to be a 1 or a 0 and I don't see anything in the code that would hint that it's going to actually make that prediction. Maybe I'll save that question for a different thread. Any help is greatly appreciated!
Upvotes: 10
Views: 8546
Reputation: 126184
The interface to tf.decode_csv()
is a little tricky. The dtype
of each column is determined by the corresponding element of the record_defaults
argument. The value for record_defaults
in your code is interpreted as each column having tf.int32
as its type, which leads to an error when it encounters floating-point data.
Let's say you have the following CSV data, containing three integer columns, followed by a floating point column:
4, 8, 9, 4.5
2, 5, 1, 3.7
2, 2, 2, 0.1
Assuming all of the columns are required, you would build record_defaults
as follows:
value = ...
record_defaults = [tf.constant([], dtype=tf.int32), # Column 0
tf.constant([], dtype=tf.int32), # Column 1
tf.constant([], dtype=tf.int32), # Column 2
tf.constant([], dtype=tf.float32)] # Column 3
col0, col1, col2, col3 = tf.decode_csv(value, record_defaults=record_defauts)
assert col0.dtype == tf.int32
assert col1.dtype == tf.int32
assert col2.dtype == tf.int32
assert col3.dtype == tf.float32
An empty value in record_defaults
signifies that the value is required. Alternatively, if (e.g.) column 2 is allowed to have missing values, you would define record_defaults
as follows:
record_defaults = [tf.constant([], dtype=tf.int32), # Column 0
tf.constant([], dtype=tf.int32), # Column 1
tf.constant([0], dtype=tf.int32), # Column 2
tf.constant([], dtype=tf.float32)] # Column 3
The second part of your question concerns how to build and train a model that predicts the value of one of the columns from the input data. Currently, the program doesn't: it simply concatenates the columns into a single tensor, called features
. You will need to define and train a model, that interprets that data. One of the simplest such approaches is linear regression, and you might find this tutorial on linear regression in TensorFlow adaptable to your problem.
Upvotes: 20
Reputation: 3359
The answer to changing the dtype is to just change the defaults like so-
record_defaults = [[1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.], [1.]]
After you do that, if you print out col1, you'll receive this message.
Tensor("DecodeCSV_43:0", shape=TensorShape([]), dtype=float32)
But there is another error that you will run into, which has been answered here. To recap the answer, the workaround is to change tf.concat
to tf.pack
like so.
features = tf.pack([col2, col3, col4, col5, col6, col7, col8, col9, col10, col11])
Upvotes: 1