unicornication
unicornication

Reputation: 627

How to load a 3D array with mixed data types into Tensorflow for training?

I am working with a dataset that looks something like this-

PERSON1 = [["Person1Id", "Rome", "Frequent Flyer", "1/2/2018"],["Person1Id", "London", "Frequent Flyer", "3/4/2018"],["Person1Id", "Paris", "Frequent Flyer", "2/4/2018"], ...]
PERSON2 = [["Person2Id", "Shenzen", "Frequent Flyer", "1/2/2018"],["Person2Id", "London", "Frequent Flyer", "2/6/2018"],["Person2Id", "Hong Kong", "Not Frequent Flyer", "1/3/2017"], ...]
PERSON3 = [["Person3Id", "Moscow", "Frequent Flyer", "1/2/2018"],["Person3Id", "London", "Frequent Flyer", "3/4/2018"],["Person3Id", "Paris", "Frequent Flyer", "2/4/2018"], ...]
...

TRAIN_X = [ 
    PERSON1, PERSON2, PERSON3, ..., PERSONN
]

TRAIN_Y = [
    1, 0, 1, ..., 1
]

The idea being that some persons are of class 1 and some of class 0, depending on training data. (The actual data arrays utilized are longer, this is a simplified version.)

My question is - given this structure of data - how might I correctly load it into Tensorflow to train a neural network system? I've worked with simpler datasets like the Iris dataset, MNIST, etc. I have no idea how to deal with more complex, real-world data like this, and I can't seem to find any documentation / resources / sample code that does anything similar.

I assume the first step here is that the data needs to be flattened, normalized, etc - in some way, however, I'm not sure how to proceed.

Upvotes: 0

Views: 516

Answers (2)

ted
ted

Reputation: 14684

You seem to have categorical data. And let's assume that's how you want it to be.

You can either pre-process it in pure python if this makes sense given your amount of data (for instance, run your pre-processing once and save the preprocessed data instead of re-processing everytime). This would mean having something like:

import numpy as np

def one_hot(index, max_dim):
    return np.eye(max_dim)[index]

destinations = {
    "Moscow": 0,
    "London": 1,
    # etc.
}
one_hot_destinations = {
    k: one_hot(v, len(destinations)) 
    for k, v in destinations.items()
}
def process_loc(loc):
    return one_hot_destinations[loc]

# do some similar processing to other properties of a "PERSON"
# so that you represent them in a vector / scalar way then:

def process_person(person_item):
    pid, loc, status, date = person
    return np.concatenate(
        [
            process_pid(pid), 
            process_loc(loc), 
            process_status(status), 
            process_date(date)
        ],
        axis=0)

TRAIN_X = [[process_person(item) for item in p] for p in PERSONS]

Or you can process it in C++ on the fly with tensorflow So what you can do is do a table_lookup which is much like looking into a dictionary:

dictionary = tf.contrib.lookup.index_table_from_file(dictionary, num_oov_buckets=0)

You have to understand that given the amount of context and code you have given it is hard to help you further than Matěj Račinský and I have tried to do. We have no clue what your down stream task is for instance.

If you want to do something NLP-related, maybe you can have a look at the blog post I wrote a few months back: Multi-label Text Classification with Tensorflow which includes the data pre-processing with table_lookup

Upvotes: 0

Matěj Račinský
Matěj Račinský

Reputation: 1804

You need to do some heavier preprocessing for that data. Neural networks can't work with text data directly, so you need to do some embedding.

Based on the type of your feature vector you will probably want to do some type of encoding of data to numbers by onehot or label encoding, or transformation to geographical coordinates, if it makes sense for the task.

You'll probably use one-hot encoding for city names, since they are categorical data, but you'll want to transform ordinal data, like date, to numbers. And think which data are useful for the task. E.g. if the problem you want to solve using NN utilizes the person ID or not.

Also, you will probably have tensors of different shapes after the input processing, so it might be better to split the input into multiple variables (e.g. if you had some features encoded as one-hot, and some not).

Also remember that you'll need to normalize inputs to the network, so choose the representation accordingly.

I'm afraid there is no plug'n'play solution for that.

Upvotes: 2

Related Questions