khemedi
khemedi

Reputation: 806

How to use tf.data in tensorflow to read .csv files?

I have three different .csv datasets that I typically read using pandas and train deep learning models with. Each data is a n by m matrix where n is the number of samples and m is the number of features. After reading the data, I do some reshaping and then feed them to my deep learning model using feed_dict:

data1 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data2 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data3 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])

data = pd.concat([data1, data2, data2], axis=1)

# Some deep learning model that work with data
# An optimizer

with tf.compat.v1.Session() as sess:
     sess.run(init)
     sess.run(optimizer, feed_dict={SOME VARIABLE: data})  

However my data is too big to fit in memory now and I am wondering how can I use tf.data to read the data instead of using pandas. Sorry if the script I've provided is a pseudo-code and not my actual code.

Upvotes: 5

Views: 8620

Answers (1)

Nikhil
Nikhil

Reputation: 1236

Applicable to TF2.0 and above. There are a few of ways to create a Dataset from CSV files:

  1. I believe you are reading CSV files with pandas and then doing this

    tf.data.Dataset.from_tensor_slices(dict(pandaDF))

  2. You can also try this out

    tf.data.experimental.make_csv_dataset

  3. Or this

    tf.io.decode_csv

  4. Also this

    tf.data.experimental.CsvDataset

Details are here: Load CSV

If you need to do processing prior to loading with Pandas then you can follow you current approach but instead doing a pd.concat([data1, data2, data2], axis=1), use the concatentate function

data1 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data2 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C'])
data3 = pd.DataFrame(np.random.uniform(low=0, high=1, size=(10,3)), columns=['A', 'B', 'C']) 

tf_dataset = tf.data.Dataset.from_tensor_slices(dict(data1))
tf_dataset = tf_dataset.concatentate(tf.data.Dataset.from_tensor_slices(dict(data2)))
tf_dataset = tf_dataset.concatentate(tf.data.Dataset.from_tensor_slices(dict(data3)))

More about concatenate

Upvotes: 5

Related Questions