Reputation: 563
I'm trying to remove 140,000 objects from an ArrayList of size 7,140,000. I expected this would take seconds (if that), but instead Java is taking multiple seconds per thousand objects. Here is my code:
for (int i = list.size(); i > P; i--)
{
int size = list.size();
int index = (int) (Math.random() * size);
list.remove(index);
}
Note: P is a constant that I previously set to 7,000,000.
The goal of the loop is to randomly remove objects from list until its size is 7,000,000.
Is Java taking such a long time because I'm starting off with over 7 million objects? I've never noticed this efficiency problem with removing from ArrayLists in the past. If it helps, I use the DrJava Beta IDE.
Upvotes: 11
Views: 3868
Reputation: 140309
Every time you remove an element from an ArrayList, it has to shuffle all of the elements with greater indexes down by one slot. Say you remove the first element of a 7M-element list - you've then got to move 6,999,999 elements too.
If you're doing this in a loop, it will take O(n^2)
time, where n
is the size of the list. For a 7M-element list, that's going to be pretty slow.
Instead, if you know which elements you want to remove in advance, you can shift all the elements down in a single pass:
int dst = 0;
for (int src = 0; src < list.size(); ++src) {
if (!toRemove(src)) {
list.set(dst++, list.get(src));
}
}
list.subList(dst, list.size()).clear();
where toRemove(src)
is some function which says whether you want to remove the src
-th element.
For example, you might construct a BitSet
with all but P
elements set:
BitSet toRemove = new BitSet(list.size());
for (int i = list.size(); i > P; i--) {
int rand;
do {
rand = Math.random() * list.size();
} while (toRemove.get(rand));
toRemove.set(rand, true);
}
You still have to shift all of the 6,999,999 elements to the right if you just remove the zero-element from a 7M element list; but any other removals don't require any more shifts on top. This algorithm is O(n)
, where n is the size of the list.
Edit: you can pick P
elements from the list (where P <= list.size()
) like this:
int dst = 0;
Random rand = new Random();
for (int src = 0; dst < P; ++src) {
if (rand.nextInt(list.size() - src) < (P-dst)) {
list.set(dst++, list.get(src));
}
}
list.subList(dst, list.size()).clear();
This strategy will select elements from the list with equal probability (*), and works well for any value of P
; it also preserves the original order.
If you want to sample K
items from a list with N
items without drawing the same element twice, there are choose(N, K) = N! / (K! * (N-K)!)
ways to do this. If you want to pick all elements from the list with equal probability, then you should pick any of these c(n,k)
different configurations.
When there are k
items left to pick from n
items, you will either:
k-1
items from the remaining n-1
items; ork
items from the remaining n-1
items.In order to ensure the equal probability of picking the K
elements overall, you need to choose one of the two options according to the number of combinations for picking from the n-1
elements:
#(combinations after taking first item)
P(take first item) = ------------------------------------------------------------------
#(combinations after taking) + #(combinations after not taking)
= C(n-1,k-1) / (C(n-1, k-1) + C(n-1, k))
= ... working omitted ...
= k / n
So, when you've got k
items left to take from n
, you should take the first item k/n
of the time.
The two interesting cases to point out are:
k == n
, k/n = 1
, so you always take the element. Intuitively, if you've got to pick n
items out of n
, you've got to take them all.k == 0
, k/n = 0
, so you never take the element. Intuitively, if you've already picked all K
of your items, you don't need to take any more.To implement this, you can simply generate a uniformly-distributed random number r
in the range [0..n)
, and "take" the element from the list if r < k
.
In terms of the implementation above, k = P - dst
, and n = list.size() - src
.
Upvotes: 7
Reputation: 116332
An ArrayList is backed by an array, so modifications need to really move items aside, and in some cases even create a whole new array.
Some possible solutions:
Consider using LinkedList or skip-list implementation instead. Do note that here, to remove an item it still takes O(N) (or O(logN) in skip-list), because it has to find it. However, you can traverse the items with a chance based on how many items you've removed.
You could randomly take items from the input into a new ArrayList till you get the number of items you wish. You have to know which items you added though, so traverse in a linear way, and have the random chooser to have a chance of how many steps to go, based on how many items you've moved.
Easiest solution: shuffle the entire input array, and then choose the first M items.
Here's a possible code for solution #3:
public static List<String> pickNRandom(List<String> lst, int m) {
Collections.shuffle(lst);
return lst.subList(0, n);
}
The disadvantage here is that it ruins the order of the items. You can overcome this by creating a copy of the list as the input, but this will take more memory (temporarily) ...
Upvotes: 6