Chondrops
Chondrops

Reputation: 758

Group a dataset by a derived column value

I want to group the following dataset by a derived value of the timestamp column, namely the year, so a predictable substring of the timestamp column.

doi                         timestamp
10.1515/cipms-2015-0089     2016-06-09T18:29:46.000046+01:00
10.1515/cipms-2015-0089     2016-06-09T18:29:46.000046+01:00
10.1007/s13595-016-0548-3   2015-06-08T17:01:10.000010+01:00

I realise that I can add my own derived column and filter based on this, but is there a way to specify it in a single groupBy statement, without adding an additional column purely for grouping purposes?

Upvotes: 1

Views: 185

Answers (1)

eliasah
eliasah

Reputation: 40380

If I understand your question correctly, you'll need to extract the year inside the group by clause :

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions.{unix_timestamp, year}

val sc: SparkContext = ??? // I consider that you are able to create both your SparkContext and SQLContext alone
val sqlContext: SQLContext = ???

import sqlContext.implicits._ // needed to use implicits like .toDF

val data = Seq(
  "10.1515/cipms-2015-0089     2016-06-09T18:29:46.000046+01:00",
  "10.1515/cipms-2015-0089     2016-06-09T18:29:46.000046+01:00",
  "10.1007/s13595-016-0548-3   2015-06-08T17:01:10.000010+01:00")
// data: Seq[String] = List(10.1515/cipms-2015-0089     2016-06-09T18:29:46.000046+01:00, 10.1515/cipms-2015-0089     2016-06-09T18:29:46.000046+01:00, 10.1007/s13595-016-0548-3   2015-06-08T17:01:10.000010+01:00)

val df = sc.parallelize(data).map(_.split("\\s+") match {
  case Array(doi, time) => (doi, time)
}).toDF("doi", "timestamp").withColumn("timestamp", unix_timestamp($"timestamp", "yyyy-MM-dd'T'hh:mm:ss").cast("timestamp"))
// df: org.apache.spark.sql.DataFrame = [doi: string, timestamp: timestamp]

df.groupBy(year($"timestamp").as("year")).count.show
// +----+-----+
// |year|count|
// +----+-----+
// |2015|    1|
// |2016|    2|
// +----+-----+

Upvotes: 2

Related Questions