peter.petrov
peter.petrov

Reputation: 39437

Spark SQL - DataFrame - select - transformation or action?

In Spark SQL (working with the Java APIs) I have a DataFrame.

The DataFrame has a select method. I wonder if it's a transformation or an action?

I just need a confirmation and a good reference which states that clearly.

Upvotes: 8

Views: 10180

Answers (3)

adarsh
adarsh

Reputation: 161

select is a transformation function

Refer spark documentation

For more informations and explanation read this

Upvotes: 0

Samuel William
Samuel William

Reputation: 11

If you execute the below code you will be able to see the output in the console

import org.apache.spark.sql.SparkSession

object learnSpark2 extends App {
    val sparksession = SparkSession.builder()
        .appName("Learn Spark")
        .config("spark.master", "local")
        .getOrCreate()

    val range = sparksession.range(1, 500).toDF("numbers")
    range.select(range.col("numbers"), range.col("numbers") + 10).show(2)
}

+-------+--------------+

|numbers|(numbers + 10)|

+-------+--------------+

| 1| 11|

| 2| 12|

If you execute the folowing code with only select and not show, you will not be able to see any output though the code is execute, then it mean select is just a transformation and it is not action. So it will not be evaluated.

object learnSpark2 extends App {
    val sparksession = SparkSession.builder()
        .appName("Learn Spark")
        .config("spark.master","local")
        .getOrCreate()

    val range = sparksession.range(1, 500).toDF("numbers")
    range.select(range.col("numbers"), range.col("numbers") + 10)
}

In the console:

19/01/03 22:46:25 INFO Utils: Successfully started service 'sparkDriver' on port 55531.

19/01/03 22:46:25 INFO SparkEnv: Registering MapOutputTracker

19/01/03 22:46:25 INFO SparkEnv: Registering BlockManagerMaster

19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: Using

org.apache.spark.storage.DefaultTopologyMapper for getting topology information

19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up

19/01/03 22:46:25 INFO DiskBlockManager: Created local directory at

C:\Users\swilliam\AppData\Local\Temp\blockmgr-9abc8a2c-15ee-4e4f-be04-9ef37ace1b7c

19/01/03 22:46:25 INFO MemoryStore: MemoryStore started with capacity 1992.9 MB

19/01/03 22:46:25 INFO SparkEnv: Registering OutputCommitCoordinator

19/01/03 22:46:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.

19/01/03 22:46:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at

http://10.192.99.214:4040

19/01/03 22:46:26 INFO Executor: Starting executor ID driver on host localhost

19/01/03 22:46:26 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 55540.
19/01/03 22:46:26 INFO NettyBlockTransferService: Server created on 10.192.99.214:55540

19/01/03 22:46:26 INFO BlockManager: Using

org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy

19/01/03 22:46:26 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)

19/01/03 22:46:26 INFO BlockManagerMasterEndpoint: Registering block manager 10.192.99.214:55540 with 1992.9 MB RAM, BlockManagerId(driver, 10.192.99.214, 55540, None)

19/01/03 22:46:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)

19/01/03 22:46:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/UDEMY/SparkJob/spark-warehouse/').
19/01/03 22:46:26 INFO SharedState: Warehouse path is 'file:/C:/UDEMY/SparkJob/spark-warehouse/'.
19/01/03 22:46:27 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/01/03 22:46:29 INFO SparkContext: Invoking stop() from shutdown hook
19/01/03 22:46:29 INFO SparkUI: Stopped Spark web UI at http://10.192.99.214:4040
19/01/03 22:46:29 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/01/03 22:46:29 INFO MemoryStore: MemoryStore cleared
19/01/03 22:46:29 INFO BlockManager: BlockManager stopped
19/01/03 22:46:29 INFO BlockManagerMaster: BlockManagerMaster stopped
19/01/03 22:46:29 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/01/03 22:46:29 INFO SparkContext: Successfully stopped SparkContext
19/01/03 22:46:29 INFO ShutdownHookManager: Shutdown hook called
19/01/03 22:46:29 INFO ShutdownHookManager: Deleting directory C:\Users\swilliam\AppData\Local\Temp\spark-c69bfb9b-f351-45af-9947-77950b23dd15
Picked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStore="C:\Program Files\SquirrelSQL\certificates\jssecacerts"

Upvotes: 1

Nikhil
Nikhil

Reputation: 1236

It is transformation. Please refer: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html

A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.

Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.

Upvotes: 8

Related Questions