Reputation: 584
I have 2 classes and every class has 140 examples, and I have 4 clients, I would like to create a non-iid dataset like the paper of McMahan, how divide examples into fragments ?
Upvotes: 1
Views: 1571
Reputation: 2941
Note: there are many notions of "non-iid-ness" that may be interesting to explore.
Label non-iid: you might want to make the distribution of labels very unbalanced across clients. Evenly distributing the number of examples, we can still get non-iid distribution such as [(35, 35), (10, 60), (50, 20), (45, 25)]
. The McMahan 2016 paper takes a similar approach, but takes a problem with 10 classes and gives most clients only two classes (the exact method is on Page 5 of the paper).
Amount of data: you might want to give some clients more data than others. With 280 examples, perhaps the split is (180, 80, 10, 10)
examples (ignores how the labels are distributed). The StackOverflow dataset in TensorFlow Federated also exhibits this, as some cleints have tens of thousands of examples, while others only have 100.
Feature non-iid: If there are patterns in the feature space, it maybe useful to restrict certain patterns to certain users. For example in an image recognition task, perhaps some camera had a different white balance, rotation, or color saturation than others (even though they have most or all labels). Instead of shuffling these randomly across synthetic clients, grouping the similar feature patterns into a single client can give a different form of non-iid.
Upvotes: 3