Zhao Chen
Zhao Chen

Reputation: 143

How to load pickle files by tensorflow's tf.data API

I have my data in multiple pickle files stored on disk. I want to use tensorflow's tf.data.Dataset to load my data into training pipeline. My code goes:

def _parse_file(path):
    image, label = *load pickle file*
    return image, label
paths = glob.glob('*.pkl')
print(len(paths))
dataset = tf.data.Dataset.from_tensor_slices(paths)
dataset = dataset.map(_parse_file)
iterator = dataset.make_one_shot_iterator()

Problem is I don't know how to implement the _parse_file fuction. The argument to this function, path, is of tensor type. I tried

def _parse_file(path):
    with tf.Session() as s:
        p = s.run(path)
        image, label = pickle.load(open(p, 'rb'))
    return image, label

and got error message:

InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'arg0' with dtype string
     [[Node: arg0 = Placeholder[dtype=DT_STRING, shape=<unknown>, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

After some search on the Internet I still have no idea how to do it. I will be grateful to anyone providing me a hint.

Upvotes: 6

Views: 9054

Answers (3)

RubinMac
RubinMac

Reputation: 174

This is how I solved this issue. I didn't use the tf.py_func; check out function "load_encoding()" below, which is what's doing the pickle reading. The FACELIB_DIR contains directories of pickled vggface2 encodings, each directory named for the person of those face encodings.

import tensorflow as tf
import pickle
import os

FACELIB_DIR='/var/noggin/FaceEncodings'

# Get list of all classes & build a quick int-lookup dictionary
labelNames = sorted([x for x in os.listdir(FACELIB_DIR) if os.path.isdir(os.path.join(FACELIB_DIR,x)) and not x.startswith('.')])
labelStrToInt = dict([(x,i) for i,x in enumerate(labelNames)])

# Function load_encoding - Loads Encoding data from enc2048 file in filepath
#    This reads an encoding from disk, and through the file path gets the label oneHot value, returns both
def load_encoding(file_path):
    with open(os.path.join(FACELIB_DIR,file_path),'rb') as fin:
        A,_ = pickle.loads(fin.read())    # encodings, source_image_name
    label_str = tf.strings.split(file_path, os.path.sep)[-2]
    return (A, labelStrToInt[label_str])

# Build the dataset of every enc2048 file in our data library
encpaths = []
for D in sorted([x for x in os.listdir(FACELIB_DIR) if os.path.isdir(os.path.join(FACELIB_DIR,x)) and not x.startswith('.')]):
    # All the encoding files
    encfiles = sorted(filter((lambda x: x.endswith('.enc2048')), os.listdir(os.path.join(FACELIB_DIR, D))))
    encpaths += [os.path.join(D,x) for x in encfiles]
dataset = tf.data.Dataset.from_tensor_slices(encpaths)

# Shuffle and speed improvements on the dataset
BATCH_SIZE = 64
from tensorflow.data import AUTOTUNE
dataset = (dataset
    .shuffle(1024)
    .cache()
    .repeat()
    .batch(BATCH_SIZE)
    .prefetch(AUTOTUNE)
)
    
# Benchmark our tf.data pipeline
import time
datasetGen = iter(dataset)
NUM_STEPS = 10000
start_time = time.time()
for i in range(0, NUM_STEPS):
    X = next(datasetGen)
totalTime = time.time() - start_time
print('==> tf.data generated {} tensors in {:.2f} seconds'.format(BATCH_SIZE * NUM_STEPS, totalTime))

Upvotes: 1

DEEPAK KUMAR
DEEPAK KUMAR

Reputation: 1

tf.py_func This function is used to solved that problem and also as menstion in doc.

Upvotes: -1

Zhao Chen
Zhao Chen

Reputation: 143

I have solved this myself. I should use tf.py_func as in this doc.

Upvotes: 3

Related Questions