Reputation: 7099

Split tensor into training and test sets

Let's say I've read in a textfile using a TextLineReader. Is there some way to split this into train and test sets in Tensorflow? Something like:

def read_my_file_format(filename_queue):
  reader = tf.TextLineReader()
  key, record_string = reader.read(filename_queue)
  raw_features, label = tf.decode_csv(record_string)
  features = some_processing(raw_features)
  features_train, labels_train, features_test, labels_test = tf.train_split(features,
                                                                            labels,
                                                                            frac=.1)
  return features_train, labels_train, features_test, labels_test

Upvotes: 25

Answers (5)

user1454804

Reputation: 1080

Something like the following should work: tf.split_v(tf.random_shuffle(...

Edit: For tensorflow>0.12 This should now be called as tf.split(tf.random.shuffle(...

Reference

See docs for tf.split and for tf.random.shuffle for examples.

Upvotes: 16

phydev

Reputation: 292

I've improvised a solution by encapsulating the train_test_split function from sklearn in order to accept tensors as input and to return tensors as well.

I'm new to tensorflow and facing the same issue, so if you have a better solution without using a different package I'd appreciate.

def train_test_split_tensors(X, y, **options):
    """
    encapsulation for the sklearn.model_selection.train_test_split function
    in order to split tensors objects and return tensors as output

    :param X: tensorflow.Tensor object
    :param y: tensorflow.Tensor object
    :dict **options: typical sklearn options are available, such as test_size and train_size
    """

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X.numpy(), y.numpy(), **options)

    X_train, X_test = tf.constant(X_train), tf.constant(X_test)
    y_train, y_test = tf.constant(y_train), tf.constant(y_test)

    del(train_test_split)

    return X_train, X_test, y_train, y_test

Upvotes: 4

Jspies

Reputation: 379

As elham mentioned, you can use scikit-learn to do this easily. scikit-learn is an open source library for machine learning. There are tons of tools for data preparation including the model_selection module, which handles comparing, validating and choosing parameters.

The model_selection.train_test_split() method is specifically designed to split your data into train and test sets randomly and by percentage.

X_train, X_test, y_train, y_test = train_test_split(features,
                                                    labels,
                                                    test_size=0.33,
                                                    random_state=42)

test_size is the percentage to reserve for testing and random_state is to seed the random sampling.

I typically use this to provide train and validation data sets, and keep true test data separately. You could just run train_test_split twice to do this as well. I.e. split the data into (Train + Validation) and Test, then split Train + Validation into two separate tensors.

Upvotes: 20

Igor Gadelha Pereira

Reputation: 51

I managed to have a nice result using the map and filter functions of the tf.data.Dataset api. Just use the map function to randomly select the examples between train and testing. In order to do that you can, for each example, get a sample from a uniform distribution and check if the sample value is below the rate division.

def split_train_test(parsed_features, train_rate):
    parsed_features['is_train'] = tf.gather(tf.random_uniform([1], maxval=100, dtype=tf.int32) < tf.cast(train_rate * 100, tf.int32), 0)
    return parsed_features

def grab_train_examples(parsed_features):
    return parsed_features['is_train']

def grab_test_examples(parsed_features):
    return ~parsed_features['is_train']

Upvotes: 4

elham shawky

Reputation: 81

import sklearn.model_selection as sk

X_train, X_test, y_train, y_test = 
sk.train_test_split(features,labels,test_size=0.33, random_state = 42)

Upvotes: 7

Split tensor into training and test sets

Answers (5)

Related Questions