Reputation: 1623
I want to sample from a Scala array, the sample size can be much larger than the length of the array. How can I do this efficiently? By using the following code the running time is linear to the sample size, when the sample size is very big it is slow if we need to do the sampling many times:
def getSample(dataArray: Array[Double], sampleSize: Int, seed: Int): Array[Double] =
{
val arrLength = dataArray.length
val r = new scala.util.Random(seed)
Array.fill(sampleSize)(dataArray(r.nextInt(arrLength)))
}
val myArr= Array(1.0,5.0,9.0,4.0,7.0)
getSample(myArr, 100000, 28)
Upvotes: 1
Views: 767
Reputation: 1143
it is easy with a list. Use the following implicit function
object ListImplicits {
implicit class SampledArray[T](in: List[T]) {
def sample(n: Int, seed:Option[Long]=None): List[T] = {
seed match {
case Some(s) => Random.setSeed(s)
case _ => // nothing
}
Random.shuffle(in).take(n)
}
}
}
And then import the object and use collection conversions to switch from Array to list (slight overhead):
import ListImplicits.SampledArray
val n = 100000
val list = (0 to n).toList.map(i => Random.nextInt())
val array = list.toArray
val t0 = System.currentTimeMillis()
array.toList.sample(5).toArray
val t1 = System.currentTimeMillis()
list.sample(5)
val t2 = System.currentTimeMillis()
println( "Array (conversion) => delta = " + (t1-t0) + " ms") // 10 ms
println( "List => delta = " + (t2-t1) + " ms") // 8 ms
Upvotes: 0
Reputation: 1918
The probability that any given element of an array of length $n$ appears at least once in a sample of size $k$ is $1-(1-1/n)^k$. If this value is close to 1, which occurs when $k$ is large compared to $n$, then the following algorithm might be a good choice depending on your needs:
import org.apache.commons.math3.random.MersennseTwister
import org.apache.commons.math3.distribution.BinomialDistribution
def getSampleCounts[T](data: Array[T], k: Int, seed: Long): Array[Int] = {
val rng = new MersenneTwister(seed)
val counts = new Array[Int](data.length)
var i = k
do {
val j = new BinomialDistribution(rng.nextLong(), i, 1.0/i)
counts(i) = j
i -= j
} while (i > 0)
counts
}
Note that this algorithm does not return a sample. Instead it returns an Array[Int]
whose $i$-th entry is equal to the number of times data(i)
appears in the random sample. This may not be suitable for all applications, but for some use cases having the sample in the form of some sort of Iterable
over (value, count) pairs (which can be obtained by data.view.zip(getSampleCounts(data, k, seed))
, for example) is actually very convenient since it often enables us to do a computation once for groups of samples (since they are equal.) For example, suppose I had an expensive function f: T => Double
and I wanted to compute the sample mean of f
applied to a random sample of size $k$ draw from data
. Then we could do the following:
data.view.zip(getSampleCounts(data, k, seed)).map({case (x, count) => f(x)*count}).sum/k
This computation for the sample mean evaluates f
$n$ instead of $k$ times (recall that we are assuming that $k$ is large compared to $n$.)
Note that getSampleCounts
will loop at most $n$ times where $n$ is data.length
. Also, sampling from the binomial distribution in each iteration, assuming this is done in a reasonable fashion in the apache.commons.math3 library, should have complexity no worse than $O(\log k)$ (inverse CDF method and binary search.) So the complexity of the above algorithm is $O(n \log k)$ where $n$ is data.length
and $k$ is the number of samples you want to draw.
Upvotes: 1
Reputation: 173
To make this sampling program run faster, use Akka actor framework to run the sampling jobs in parallel. Create a master actor for distributing the sampling works to Worker actors and also to concatenate the elements from different workers. So each Worker actor would prepare/collect a fixed number of sample elements and give back the resulting collection as an immutable array to the master. Upon receiving the 'WorkDone' user-defined message from Worker, the Master actor concatenates the elements into the final collection.
Upvotes: 0
Reputation: 16412
There is no way around it. If you need to take N elements with constant time element access the complexity will be O(n)
(linear) no matter what.
You can deffer/amortize the cost by making it lazy. For instance you can return a Stream
or Iterator
that evaluates each element as you access it. This will help you save on memory usage if you can fold that stream as you are consuming it. In other words you can skip the copy part and work directly with initial array - not always possible, depends on the task.
Upvotes: 0