U2EF1
U2EF1

Reputation: 13289

Combining Random Samples

I have a set of elements U (initially unknown size) and I would like to generate a random sample of n << |U| elements. Stream sampling works fine for this.

The issue comes when I have subdivided U into several subsets and taken a random sample of each subset (each sample contains k <= n elements, but usually k = n). I also know how many elements are in each subset. I would like to know how to combine these samples (preferably merging two samples at a time) into one size n sample.

Or put another way, given distinct sets A and B, and random samples a and b, I would like to make cab, such that c is a random sample of AB and I may specify the size of c (usually |c| will be about the same size as |a|).

Upvotes: 5

Views: 683

Answers (2)

David
David

Reputation: 1419

Act as if you are still sampling from U. To choose a sample, first choose the subset S_i from which it should come. Do this in proportion to the relative S_i sizes. So if S_1 is 20% of U, you choose your sample from S_1 with a 20% probability. Once you've chosen the subset, you can take any one of the samples you have from that subset and use it in the final sample. This could run into problems if the k values are less than n, but if usually k = n, it probably won't be a problem for you.

Putting this in terms of your A and B formulation, build up c as follows: with probability |A|/|AB| take your next sample from a; with probability |B|/|AB| = 1 - (|A|/|AB|) take your next sample from b. (As I noted above, this could run into problems if |a| is not somewhat larger than n * (|A|/|AB|) (and the equivalent for |b|), but if that's the case, it's not clear to me that you can do what you want to do.) This lets you build up your sample two subsets at a time.

Upvotes: 3

ElKamina
ElKamina

Reputation: 7817

If |A|==|B| and |a|==|b|, then you should not worry at all. Just do a regular ransom sampling from aUb.

Upvotes: 0

Related Questions