Reputation: 2002
Is there something like summary function in spark like that in "R".
The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types.
I am interested in getting the results for string types also like the first four max occuring strings(groupby kind of operation) , number of uniques etc.
Is there any preexisting code for this ?
If not what please suggest the best way to deal with string types.
Upvotes: 2
Views: 826
Reputation: 27456
I don't think there is such a thing for String in MLlib. But it would probably be a valuable contribution, if you are going to implement it.
Calculating just one of these metrics is easy. E.g. for top 4 by frequency:
def top4(rdd: org.apache.spark.rdd.RDD[String]) =
rdd
.map(s => (s, 1))
.reduceByKey(_ + _)
.map { case (s, count) => (count, s) }
.top(4)
.map { case (count, s) => s }
Or number of uniques:
def numUnique(rdd: org.apache.spark.rdd.RDD[String]) =
rdd.distinct.count
But doing this for all metrics in a single pass takes more work.
These examples assume that, if you have multiple "columns" of data, you have split each column into a separate RDD. This is a good way to organize the data, and it's necessary for operations that perform a shuffle.
What I mean by splitting up the columns:
def split(together: RDD[(Long, Seq[String])],
columns: Int): Seq[RDD[(Long, String)]] = {
together.cache // We will do N passes over this RDD.
(0 until columns).map {
i => together.mapValues(s => s(i))
}
}
Upvotes: 1