rabinnh
rabinnh

Reputation: 178

Spark 2.0.2 doesn't seem to think that "groupBy" is returning a DataFrame

This feels sort of silly, but I am migrating from Spark 1.6.1 to Spark 2.0.2. I was using the Databrick CSV library, and am now attempting to use the builtin CSV DataFrameWriter.

The following code:

    // Get an SQLContext
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    var sTS = lTimestampToSummarize.toString()
    val sS3InputPath = "s3://measurements/" + sTS + "/*"

    // Read all measurements - note that all subsequent ETLs will reuse dfRaw
    val dfRaw = sqlContext.read.json(sS3InputPath)

    // Filter just the user/segment timespent records
    val dfSegments = dfRaw.filter("segment_ts <> 0").withColumn("views", lit(1))

    // Aggregate views and timespent per user/segment tuples
    val dfUserSegments : DataFrame = dfSegments.groupBy("company_id", "division_id", "department_id", "course_id", "user_id", "segment_id")
        .agg(sum("segment_ts").alias("segment_ts_sum"), sum("segment_est").alias("segment_est_sum"), sum("views").alias("segment_views"))

    // The following will write CSV files to the S3 bucket
    val sS3Output = "s3://output/" + sTS + "/usersegment/"
    dfUserSegments.write.csv(sS3Output)

Returns this error:

[error] /home/Spark/src/main/scala/Example.scala:75: type mismatch;
[error]  found   : Unit
[error]  required: org.apache.spark.sql.DataFrame
[error]     (which expands to)  org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
[error]         dfUserSegments.write.csv(sS3Output)
[error]                                 ^
[error] one error found
[error] (compile:compile) Compilation failed
[error] Total time: 2 s, completed Jun 5, 2017 5:00:12 PM

I know I must be interpreting the error wrong, because I set dfUserSegments to explicitly be a DataFrame, and yet the compiler is telling me that it is of type Unit (no type).

Any help is appreciated.

Upvotes: 1

Views: 258

Answers (1)

zsxwing
zsxwing

Reputation: 20826

You don't show the whole method. I guess it's because the method return type is DataFrame, but the last statement in this method is dfUserSegments.write.csv(sS3Output), and csv's return type is Unit.

Upvotes: 6

Related Questions