Eliza
Eliza

Reputation: 584

TFF: How split data of each client

Why in the federated learning task, we don't split our dataset to train, test and validation, we make only train and test .

Upvotes: 3

Views: 558

Answers (1)

Zachary Garrett
Zachary Garrett

Reputation: 2941

The choice of how to split the datasets is really up to the evaluator and what they are trying to accomplish. The preprocessed datasets in TFF (from tff.simulation.datasets) are usually only split into two, but they can be rejoined and split again in whatever way is desired.

One thing to consider: there are (at least) two dimensions that may be interesting to split on for federated learning.

  1. examples: Splitting a single client's dataset into train, test, and validation. This could possibly be seen as most analogous to the centralized training regime. Most TFF datasets use this.
  2. users: Splitting users into train, test, and heldout users might be particularly interesting in the federated regime. This might be able to answer how well a global model generalizes to unseen users, but might be heavily affected by the non-iid ness of the individual datasets and splits. This is used in a few TFF provided datasets.

Furthermore, both of these could be time based (if there is a notion of time), for example splitting each clients dataset into "previous day" (train) and "next day" (test). Or, as is often the case in practice with cross-device FL, splitting by time of day (users available for training at night maybe different than mid-day), Eichner 2019 performed some experiments using this setup.

Note: the tff.simulation.datasets.stackoverflow.load_data does have three splits named train, held_out and test. Please read the documentation carefully as it utilizes both types of splitting mentioned above.

Upvotes: 3

Related Questions