Reputation: 51
I am wondering if there are any reasonable ways to generate clients data sets for federated learning simulation using tff core code? In the tutorial for the federated core, it uses the MNIST database with each client has only one distinct label in his data set. In this case, there are only 10 different labels available. If I want to have more clients, how can I do that? Thanks in advance.
Upvotes: 2
Views: 1487
Reputation: 880
If you want to create a dataset from scratch you can use tff.simulation.FromTensorSlicesClientData to covert tensors to tff clientdata object. Just you need to pass dictionary having client id as key and dataset as value.
client_train_dataset = collections.OrderedDict()
for i in range(1, split+1):
client_name = "client_" + str(i)
start = image_per_set * (i-1)
end = image_per_set * i
print(f"Adding data from {start} to {end} for client : {client_name}")
data = collections.OrderedDict((('label', y_train[start:end]), ('pixels', x_train[start:end])))
client_train_dataset[client_name] = data
train_dataset = tff.simulation.FromTensorSlicesClientData(client_train_dataset)
you can check my complete implementation here, where i have splitted mnist into 4 clients.
Upvotes: 1
Reputation: 1405
There are preprocessed simulation datasets in TFF that should serve this purpose quite nicely. Take, for example, loading EMNIST where the images are partitioned by writer, corresponding to user, as opposed to label. This can be loaded into a Python runtime rather simply (here creating train data with 100 clients):
source, _ = tff.simulation.datasets.emnist.load_data()
def map_fn(example):
return {'x': tf.reshape(example['pixels'], [-1]), 'y': example['label']}
def client_data(n):
ds = source.create_tf_dataset_for_client(source.client_ids[n])
return ds.repeat(10).map(map_fn).shuffle(500).batch(20)
train_data = [client_data(n) for n in range(100)]
There are existing datasets partitioned in a similar way for extended MNIST (IE, includes handwritten characters in addition to digits), Shakespeare plays (partitioned by character), and Stackoverflow posts (partitioned by user). Documentation on these datasets can be found here.
If you wish to create your own custom dataset, please see the answer here.
Upvotes: 0