Reputation: 296
I am testing some algorithms in TensorFlow Federated (TFF). In this regard, I would like to test and compare them on the same federated dataset with different "levels" of data heterogeneity, i.e. non-IIDness.
Hence, I would like to know whether there is any way to control and tune the "level" of non-IIDness in a specific federated dataset, in an automatic or semi-automatic fashion, e.g. by means of TFF APIs or just traditional TF API (maybe inside the Dataset utils).
To be more practical: for instance, the EMNIST federated dataset provided by TFF has 3383 clients with each one of them having their handwritten characters. However, these local dataset seems to be quite balanced in terms of number of local examples and in terms of represented classes (all classes are, more or less, represented locally). If I would like to have a federated dataset (e.g., starting by the TFF's EMNIST one) that is:
tff.simulation.datasets.build_single_label_dataset
documentation here. If so, how should I use it from a federated dataset such as the ones already provided by TFF?;how should I proceed inside the TFF framework to prepare a federated dataset with those characteristics?
Should I do all the stuff by hand? Or do some of you have some advices to automate this process?
An additional question: in this paper "Measuring the Effects of Non-Identical Data Distribution for Federated Visual Classification", by Hsu et al., they exploit the Dirichlet distribution to synthesize a population of non-identical clients, and they use a concentration parameter to control the identicalness among clients. This seems an wasy-to-tune way to produce datasets with different levels of heterogeneity. Any advice about how to implement this strategy (or a similar one) inside the TFF framework, or just in TensorFlow (Python) considering a simple dataset such as the EMNIST, would be very useful too.
Thank you a lot.
Upvotes: 4
Views: 701
Reputation: 2941
For Federated Learning simulations, its quite reasonable to setup the client datasets in Python, in the experiment driver, to achieve the desired distributions. At some high-level, TFF handles modeling data location ("placements" in the type system) and computation logic. Re-mixing/generating a simulation dataset is not quite core to the library, though there are helpful libraries as you've found. Doing this directly in python by manipulating the tf.data.Dataset
and then "pushing" the client datasets into a TFF computation seems straightforward.
Yes, tff.simulation.datasets.build_single_label_dataset
is intended for this purpose.
It takes a tf.data.Dataset
and essentially filters out all examples that don't match desired_label
values for the label_key
(assuming the dataset yields dict
like structures).
For EMNIST, to create a dataset of all the ones (regardless of user), this could be achieved by:
train_data, _ = tff.simulation.datasets.emnist.load_data()
ones = tff.simulation.datasets.build_single_label_dataset(
train_data.create_tf_dataset_from_all_clients(),
label_key='label', desired_label=1)
print(ones.element_spec)
>>> OrderedDict([('label', TensorSpec(shape=(), dtype=tf.int32, name=None)), ('pixels', TensorSpec(shape=(28, 28), dtype=tf.float32, name=None))])
print(next(iter(ones))['label'])
>>> tf.Tensor(1, shape=(), dtype=int32)
Using a combination of tf.data.Dataset.repeat
and tf.data.Dataset.take
can be used to create data imbalances.
train_data, _ = tff.simulation.datasets.emnist.load_data()
datasets = [train_data.create_tf_dataset_for_client(id) for id in train_data.client_ids[:2]]
print([tf.data.experimental.cardinality(ds).numpy() for ds in datasets])
>>> [93, 109]
datasets[0] = datasets[0].repeat(5)
datasets[1] = datasets[1].take(5)
print([tf.data.experimental.cardinality(ds).numpy() for ds in datasets])
>>> [465, 5]
Upvotes: 5