Reputation: 1086
Is there any efficient way for getting the nth element of a set in Java? I know 2 ways of doing it: - By iterating through it until I reach the required element - By converting it to an ArrayList and getting the elements from that ArrayList The question is that is there any other way to get the nth element of it quickly. I would mainly need a feature like that for the TreeSets.
EDIT: For example if I want to select 1000 random elements from a 10 000 000 element long treemap or treeset, very frequently (i.e. every 2-3 seconds), then cloning it to an arraylist all the time is very inefficient, and iterating through so many elements is also inefficient.
Upvotes: 2
Views: 2184
Reputation: 7152
If you are sure that you need n elements from random positions in the set (kind of like a statistical sampling), then you may want to consider just iterating through the set once and pick up the samples, by the desired probability, as you iterate through the set. This way is more efficient as you will iterate through the set only once.
The following program demonstrates the idea:
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Random;
import java.util.Set;
import java.util.TreeSet;
public class SamplingFromSet {
public static void main(String[] args) {
Set<String> population = new TreeSet<>();
/*
* Populate the set
*/
final int popSize = 17;
for (int i=0; i<popSize; i++) {
population.add(getRandomString());
}
List<String> sample
= sampleFromPopulation(population, 3 /*sampleSize */);
System.out.println("population is");
System.out.println(population.toString());
System.out.println("sample is");
System.out.println(sample.toString());
}
/**
* Pick some samples for a population
* @param population
* @param sampleSize - number of samples
* @return
*/
private static <T>
List<T> sampleFromPopulation(Set<T> population
, int sampleSize) {
float sampleProb = ((float) sampleSize) / population.size();
List<T> sample = new ArrayList<>();
Iterator<T> iter = population.iterator();
while (iter.hasNext()) {
T element = iter.next();
if (random.nextFloat()<sampleProb) {
/*
* Lucky Draw!
*/
sample.add(element);
}
}
return sample;
}
private static Random random = new Random();
private static String getRandomString() {
return String.valueOf(random.nextInt());
}
}
Output of this program:
population is
[-1488564139, -1510380623, -1980218182, -354029751, -564386445, -57285541, -753388655, -775519772, 1538266464, 2006248253, 287039585, 386398836, 435619764, 48109172, 580324150, 64275438, 860615531]
sample is
[-57285541, -753388655, 386398836]
Update
The above program, however, has a caveat -- since the picking up
of samples in that one walk through the set is done by probability,
the returned sample
may, depending on your luck of the day,
have fewer or more samples than specified.
This problem, however, can remedied with a slight change of procedure,
which uses a slightly different method signature:
/**
* Pick some samples from a population
* @param population
* @param sampleSize - number of samples
* @param exactSize - a boolean to control whether or not
* the returned sample list must be of the exact size as
* specified.
* @return
*/
private static <T>
List<T> sampleFromPopulation(Set<T> population
, int sampleSize
, boolean exactSize);
Prevention of oversampling
In the one iteration through the population, we over sample a bit, and then at the end we drop some samples if we do have too many.
Prevention of undersampling
Note also that, even with oversampling, there is a non-zero probability that, at the end of the one iteration through the population, we still get less samples than desired. If that happen (unlikely), we will recursively calling the same method again as a second try. (This recursion has a probability approaching one to terminate because it is very unlike that, in repeated recursive call into the method, we consistently get undersampling.)
The following code implements the new sampleFromPopulation()
method:
private static <T>
List<T> sampleFromPopulation(Set<T> population
, int sampleSize
, boolean exactSize) {
int popSize = population.size();
double sampleProb = ((double) sampleSize) / popSize;
final double OVER_SAMPLING_MULIT = 1.2;
if (exactSize) {
/*
* Oversampling to enhance of chance of getting enough
* samples (if we then have too many, we will drop them
* later)
*/
sampleProb = sampleProb * OVER_SAMPLING_MULIT;
}
List<T> sample = new LinkedList<>(); // linked list for fast removal
Iterator<T> iter = population.iterator();
while (iter.hasNext()) {
T element = iter.next();
if (random.nextFloat()<sampleProb) {
/*
* Lucky Draw!
*/
sample.add(element);
}
}
int samplesTooMany = sample.size() - sampleSize;
if (!exactSize || samplesTooMany==0) {
return sample;
} else if (samplesTooMany>0) {
Set<Integer> indexesToRemoveAsSet = new HashSet<>();
for (int i=0; i<samplesTooMany; ) {
int candidate = random.nextInt(sample.size());
if (indexesToRemoveAsSet.add(candidate)) {
/*
* add() returns true if candidate was not
* previously in the set
*/
i++; // proceed to draw next index
}
}
List<Integer> indexesToRemoveAsList
= new ArrayList<>(indexesToRemoveAsSet);
Collections.sort(indexesToRemoveAsList
, (i1, i2) -> i2.intValue() - i1.intValue()); // desc order
/*
* Now we drop from the tail of the list
*/
for (Integer index : indexesToRemoveAsList) {
sample.remove((int) index); // remove by index (not by element)
}
return sample;
} else {
/*
* we were unluckly that we oversampling we still
* get less samples than specified, so here we call
* this very same method again recursively
*/
return sampleFromPopulation(population, sampleSize, exactSize);
}
}
Upvotes: 1
Reputation: 88727
If your requirement is to select random elements out of a rather huge set then you should ask yourself whether a set is the best fit for that.
If you want to use the built-in sets there are several challenges you'd face.
TreeSet
A TreeSet is an ordered set and thus would allow you to access the n-th element. However, you'd have to iterate to position n
since there is no array that allows random access like an ArrayList would. As the name implies the nodes in a TreeSet form a tree and the nodes most likely are stored anywhere in memory. Because of this to get the n-th element you'd have to start at the first node and hop from node to node until you reach position n
- which is similar to how you'd do it in a LinkedList.
If all you want to do is select a random element there are several options:
tailSet(randomKey)
and getting the first element of that tail set. Of course you'd have to handle random keys that are outside the elements' range. That way a lookup would basically be a binary search.HashSet
HashSets basically consist of 2 things: an array of buckets and a linked list or tree for collisions, i.e. if 2 elements would be mapped to the same bucket. Getting a random element might then be done by accessing a random bucket (this would be random access) and then iterating over the elements in that bucket for a random number of times.
Upvotes: 1