zcourts
zcourts

Reputation: 5043

Efficiently find overlapping JodaTime Intervals

I have a Seq of Intervals and I want to collapse overlapping ones into each other. I've written this:

val today = DateTime.now()
val a: Seq[Interval] = Seq(
  new Interval(today.minusDays(11), today.minusDays(10)), //1 day interval ending 10 days ago
  new Interval(today.minusDays(10), today.minusDays(8)), //2 day interval ending 8 days ago, overlaps with above
  new Interval(today.minusDays(7), today.minusDays(5)), //2 day interval ending 5 days ago, DOES NOT OVERLAP
  new Interval(today.minusDays(4), today.minusDays(1)), //3 day interval ending 1 day ago, DOES NOT OVERLAP above
  new Interval(today.minusDays(4), today) //4 day interval ending today, overlaps with above
)
val actual = a.sortBy(_.getStartMillis).foldLeft(Seq[Interval]())((vals, e) => {
  if (vals.isEmpty) {
    vals ++ Seq(e)
  } else {
    val fst = vals.last
    val snd = e
    if (snd.getStart.getMillis <= fst.getEnd.getMillis) /*they overlap*/ {
      vals.dropRight(1) ++ Seq(new Interval(fst.getStart, snd.getEnd)) //combine both into one interval
    } else {
      vals ++ Seq(e)
    }
  }
})
val expected = Seq(
  new Interval(today.minusDays(11), today.minusDays(8)),
  new Interval(today.minusDays(7), today.minusDays(5)),
  new Interval(today.minusDays(4), today)
)
println(
  s"""
     |Expected: $expected
     |Actual  : $actual
   """.stripMargin)
assert(expected == actual)

which works. My initial concern was with the line vals.dropRight(1) ++ Seq(new Interval(fst.getStart, snd.getEnd))

I suspect dropRight is O(n - m) where n = |vals| and m = 1 in this case.

This implementation becomes expensive if |a| is in the order of hundreds of thousands or more. In fact vals ++ Seq(e) is also an issue if it does n + 1 for each a[i].

Firstly, is my assessment correct?

Second, is there a more efficient way to write this without mutable data structures?

I've written this out of the context it's being used in, the actual application is in a Spark job (so that foldLeft would be folding on an RDD[MyType]).

EDIT : Completely forgot there's no foldLeft on an RDD (ignore Spark I'll have to think of another way but I'm still interested in an answer to this, minus the fact it won't work in Spark)

Upvotes: 2

Views: 616

Answers (1)

R&#252;diger Klaehn
R&#252;diger Klaehn

Reputation: 12565

Take a look at the IntervalSeq[A] and IntervalTrie[A] data structures in the spire math library. IntervalTrie[A] allows doing boolean operations such as union and intersection of sets of non-overlapping intervals with extremely high performance. It requires the element type to be losslessly convertable to a long, which is the case for joda DateTime.

Here is how you would solve this problem using spire:

First, make sure you have the right dependencies. Add a dependency to spire extras to your build.sbt:

libraryDependencies += "org.spire-math" %% "spire-extras" % "0.12.0"

Next, you need to define a IntervalTrie.Element typeclass instance for org.joda.time.DateTime:

implicit val jodaDateTimeIsIntervalTrieElement: IntervalTrie.Element[DateTime] = new IntervalTrie.Element[DateTime] {
  override implicit val order: Order[DateTime] = Order.by(_.getMillis)
  override def toLong(value: DateTime): Long = value.getMillis
  override def fromLong(key: Long): DateTime = new DateTime(key)
}

Now, you can use IntervalTrie to perform boolean operations on DateTime intervals (note that Interval here refers to the generic interval type spire.math.Interval, not joda Interval),

// import the Order[DateTime] instance (for spire.math.Interval creation)
import jodaDateTimeIsIntervalTrieElement.order

// create 100000 random Interval[DateTime]
val r = new scala.util.Random()
val n = 100000
val intervals = (0 until n).map { i =>
  val ticks = r.nextInt(1000000000) * 2000L
  val t0 = new DateTime(ticks)
  val t1 = new DateTime(ticks + r.nextInt(1000000000))
  val i = IntervalTrie(spire.math.Interval(t0, t1))
  i
}

//compute the union of all of them using IntervalTrie
val t0 = System.nanoTime()
val union = intervals.foldLeft(IntervalTrie.empty[DateTime])(_ | _)
val dt = (System.nanoTime() - t0) / 1.0e9
println(s"Union of $n random intervals took $dt seconds!")

This is very fast. On my machine:

Union of 100000 random intervals took 0.045001056 seconds!

Doing a proper benchmark with warmup will make this even faster.

Upvotes: 4

Related Questions