How to load a 3D array with mixed data types into Tensorflow for training?

Question

I am working with a dataset that looks something like this-

PERSON1 = [["Person1Id", "Rome", "Frequent Flyer", "1/2/2018"],["Person1Id", "London", "Frequent Flyer", "3/4/2018"],["Person1Id", "Paris", "Frequent Flyer", "2/4/2018"], ...]
PERSON2 = [["Person2Id", "Shenzen", "Frequent Flyer", "1/2/2018"],["Person2Id", "London", "Frequent Flyer", "2/6/2018"],["Person2Id", "Hong Kong", "Not Frequent Flyer", "1/3/2017"], ...]
PERSON3 = [["Person3Id", "Moscow", "Frequent Flyer", "1/2/2018"],["Person3Id", "London", "Frequent Flyer", "3/4/2018"],["Person3Id", "Paris", "Frequent Flyer", "2/4/2018"], ...]
...

TRAIN_X = [ 
    PERSON1, PERSON2, PERSON3, ..., PERSONN
]

TRAIN_Y = [
    1, 0, 1, ..., 1
]

The idea being that some persons are of class 1 and some of class 0, depending on training data. (The actual data arrays utilized are longer, this is a simplified version.)

My question is - given this structure of data - how might I correctly load it into Tensorflow to train a neural network system? I've worked with simpler datasets like the Iris dataset, MNIST, etc. I have no idea how to deal with more complex, real-world data like this, and I can't seem to find any documentation / resources / sample code that does anything similar.

I assume the first step here is that the data needs to be flattened, normalized, etc - in some way, however, I'm not sure how to proceed.

Matěj Račinsk&#253; · Accepted Answer

You need to do some heavier preprocessing for that data. Neural networks can't work with text data directly, so you need to do some embedding.

Based on the type of your feature vector you will probably want to do some type of encoding of data to numbers by onehot or label encoding, or transformation to geographical coordinates, if it makes sense for the task.

You'll probably use one-hot encoding for city names, since they are categorical data, but you'll want to transform ordinal data, like date, to numbers. And think which data are useful for the task. E.g. if the problem you want to solve using NN utilizes the person ID or not.

Also, you will probably have tensors of different shapes after the input processing, so it might be better to split the input into multiple variables (e.g. if you had some features encoded as one-hot, and some not).

Also remember that you'll need to normalize inputs to the network, so choose the representation accordingly.

I'm afraid there is no plug'n'play solution for that.

How to load a 3D array with mixed data types into Tensorflow for training?

Answers (2)

Related Questions