Reputation: 13289
I have a set of elements U (initially unknown size) and I would like to generate a random sample of n << |U| elements. Stream sampling works fine for this.
The issue comes when I have subdivided U into several subsets and taken a random sample of each subset (each sample contains k <= n elements, but usually k = n). I also know how many elements are in each subset. I would like to know how to combine these samples (preferably merging two samples at a time) into one size n sample.
Or put another way, given distinct sets A and B, and random samples a and b, I would like to make c ⊆ a ∪ b, such that c is a random sample of A ∪ B and I may specify the size of c (usually |c| will be about the same size as |a|).
Upvotes: 5
Views: 683
Reputation: 1419
Act as if you are still sampling from U. To choose a sample, first choose the subset S_i from which it should come. Do this in proportion to the relative S_i sizes. So if S_1 is 20% of U, you choose your sample from S_1 with a 20% probability. Once you've chosen the subset, you can take any one of the samples you have from that subset and use it in the final sample. This could run into problems if the k values are less than n, but if usually k = n, it probably won't be a problem for you.
Putting this in terms of your A and B formulation, build up c as follows: with probability |A|/|A ∪ B| take your next sample from a; with probability |B|/|A ∪ B| = 1 - (|A|/|A ∪ B|) take your next sample from b. (As I noted above, this could run into problems if |a| is not somewhat larger than n * (|A|/|A ∪ B|) (and the equivalent for |b|), but if that's the case, it's not clear to me that you can do what you want to do.) This lets you build up your sample two subsets at a time.
Upvotes: 3
Reputation: 7817
If |A|==|B| and |a|==|b|, then you should not worry at all. Just do a regular ransom sampling from aUb.
Upvotes: 0