Optimize a piece of code that uses map action

Question

The following piece of code takes a lot of time on 4Gb of raw data in a cluster:

df.select("type", "user_pk", "item_pk","timestamp")
      .withColumn("date",to_date(from_unixtime($"timestamp")))
      .filter($"date" > "2018-04-14")
      .select("type", "user_pk", "item_pk")
      .map {
        row => {
          val typef = row.get(0).toString
          val user = row.get(1).toString
          val item = row.get(2).toString
          (typef, user, item)
        }
      }

The output should be of type Dataset[(String,String,String)].

I guess that map part takes a lot of time. Is there any way to optimize this piece of code?

Alper t. Turker · Accepted Answer

I seriously doubt the map is the problem, nonetheless I wouldn't use it at all and go with standard Dataset converter

import df.sparkSession.implicits._

df.select("type", "user_pk", "item_pk","timestamp")
  .withColumn("date",to_date(from_unixtime($"timestamp")))
  .filter($"date" > "2018-04-14")
  .select($"type" cast "string", $"user_pk" cast "string", $"item_pk" cast "string")
  .as[(String,String,String)]

Optimize a piece of code that uses map action

Answers (2)

Related Questions