Kundan Kumar
Kundan Kumar

Reputation: 2002

Summary Statistics for string types in spark

Is there something like summary function in spark like that in "R".

The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types.

I am interested in getting the results for string types also like the first four max occuring strings(groupby kind of operation) , number of uniques etc.

Is there any preexisting code for this ?

If not what please suggest the best way to deal with string types.

Upvotes: 2

Views: 826

Answers (1)

Daniel Darabos
Daniel Darabos

Reputation: 27456

I don't think there is such a thing for String in MLlib. But it would probably be a valuable contribution, if you are going to implement it.

Calculating just one of these metrics is easy. E.g. for top 4 by frequency:

def top4(rdd: org.apache.spark.rdd.RDD[String]) =
  rdd
    .map(s => (s, 1))
    .reduceByKey(_ + _)
    .map { case (s, count) => (count, s) }
    .top(4)
    .map { case (count, s) => s }

Or number of uniques:

def numUnique(rdd: org.apache.spark.rdd.RDD[String]) =
  rdd.distinct.count

But doing this for all metrics in a single pass takes more work.


These examples assume that, if you have multiple "columns" of data, you have split each column into a separate RDD. This is a good way to organize the data, and it's necessary for operations that perform a shuffle.

What I mean by splitting up the columns:

def split(together: RDD[(Long, Seq[String])],
          columns: Int): Seq[RDD[(Long, String)]] = {
  together.cache // We will do N passes over this RDD.
  (0 until columns).map {
    i => together.mapValues(s => s(i))
  }
}

Upvotes: 1

Related Questions