Reputation: 627
I am working with a dataset that looks something like this-
PERSON1 = [["Person1Id", "Rome", "Frequent Flyer", "1/2/2018"],["Person1Id", "London", "Frequent Flyer", "3/4/2018"],["Person1Id", "Paris", "Frequent Flyer", "2/4/2018"], ...]
PERSON2 = [["Person2Id", "Shenzen", "Frequent Flyer", "1/2/2018"],["Person2Id", "London", "Frequent Flyer", "2/6/2018"],["Person2Id", "Hong Kong", "Not Frequent Flyer", "1/3/2017"], ...]
PERSON3 = [["Person3Id", "Moscow", "Frequent Flyer", "1/2/2018"],["Person3Id", "London", "Frequent Flyer", "3/4/2018"],["Person3Id", "Paris", "Frequent Flyer", "2/4/2018"], ...]
...
TRAIN_X = [
PERSON1, PERSON2, PERSON3, ..., PERSONN
]
TRAIN_Y = [
1, 0, 1, ..., 1
]
The idea being that some persons are of class 1
and some of class 0
, depending on training data. (The actual data arrays utilized are longer, this is a simplified version.)
My question is - given this structure of data - how might I correctly load it into Tensorflow to train a neural network system? I've worked with simpler datasets like the Iris dataset, MNIST, etc. I have no idea how to deal with more complex, real-world data like this, and I can't seem to find any documentation / resources / sample code that does anything similar.
I assume the first step here is that the data needs to be flattened, normalized, etc - in some way, however, I'm not sure how to proceed.
Upvotes: 0
Views: 516
Reputation: 14684
You seem to have categorical data. And let's assume that's how you want it to be.
You can either pre-process it in pure python if this makes sense given your amount of data (for instance, run your pre-processing once and save the preprocessed data instead of re-processing everytime). This would mean having something like:
import numpy as np
def one_hot(index, max_dim):
return np.eye(max_dim)[index]
destinations = {
"Moscow": 0,
"London": 1,
# etc.
}
one_hot_destinations = {
k: one_hot(v, len(destinations))
for k, v in destinations.items()
}
def process_loc(loc):
return one_hot_destinations[loc]
# do some similar processing to other properties of a "PERSON"
# so that you represent them in a vector / scalar way then:
def process_person(person_item):
pid, loc, status, date = person
return np.concatenate(
[
process_pid(pid),
process_loc(loc),
process_status(status),
process_date(date)
],
axis=0)
TRAIN_X = [[process_person(item) for item in p] for p in PERSONS]
Or you can process it in C++ on the fly with tensorflow
So what you can do is do a table_lookup
which is much like looking into a dictionary:
dictionary = tf.contrib.lookup.index_table_from_file(dictionary, num_oov_buckets=0)
You have to understand that given the amount of context and code you have given it is hard to help you further than Matěj Račinský and I have tried to do. We have no clue what your down stream task is for instance.
If you want to do something NLP-related, maybe you can have a look at the blog post I wrote a few months back: Multi-label Text Classification with Tensorflow which includes the data pre-processing with table_lookup
Upvotes: 0
Reputation: 1804
You need to do some heavier preprocessing for that data. Neural networks can't work with text data directly, so you need to do some embedding.
Based on the type of your feature vector you will probably want to do some type of encoding of data to numbers by onehot or label encoding, or transformation to geographical coordinates, if it makes sense for the task.
You'll probably use one-hot encoding for city names, since they are categorical data, but you'll want to transform ordinal data, like date, to numbers. And think which data are useful for the task. E.g. if the problem you want to solve using NN utilizes the person ID or not.
Also, you will probably have tensors of different shapes after the input processing, so it might be better to split the input into multiple variables (e.g. if you had some features encoded as one-hot, and some not).
Also remember that you'll need to normalize inputs to the network, so choose the representation accordingly.
I'm afraid there is no plug'n'play solution for that.
Upvotes: 2