Reputation: 39437
In Spark SQL (working with the Java APIs) I have a DataFrame
.
The DataFrame
has a select
method.
I wonder if it's a transformation or an action?
I just need a confirmation and a good reference which states that clearly.
Upvotes: 8
Views: 10180
Reputation: 161
select is a transformation function
For more informations and explanation read this
Upvotes: 0
Reputation: 11
If you execute the below code you will be able to see the output in the console
import org.apache.spark.sql.SparkSession
object learnSpark2 extends App {
val sparksession = SparkSession.builder()
.appName("Learn Spark")
.config("spark.master", "local")
.getOrCreate()
val range = sparksession.range(1, 500).toDF("numbers")
range.select(range.col("numbers"), range.col("numbers") + 10).show(2)
}
+-------+--------------+
|numbers|(numbers + 10)|
+-------+--------------+
| 1| 11|
| 2| 12|
If you execute the folowing code with only select and not show, you will not be able to see any output though the code is execute, then it mean select is just a transformation and it is not action. So it will not be evaluated.
object learnSpark2 extends App {
val sparksession = SparkSession.builder()
.appName("Learn Spark")
.config("spark.master","local")
.getOrCreate()
val range = sparksession.range(1, 500).toDF("numbers")
range.select(range.col("numbers"), range.col("numbers") + 10)
}
In the console:
19/01/03 22:46:25 INFO Utils: Successfully started service 'sparkDriver' on port 55531.
19/01/03 22:46:25 INFO SparkEnv: Registering MapOutputTracker
19/01/03 22:46:25 INFO SparkEnv: Registering BlockManagerMaster
19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/01/03 22:46:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/01/03 22:46:25 INFO DiskBlockManager: Created local directory at
C:\Users\swilliam\AppData\Local\Temp\blockmgr-9abc8a2c-15ee-4e4f-be04-9ef37ace1b7c
19/01/03 22:46:25 INFO MemoryStore: MemoryStore started with capacity 1992.9 MB
19/01/03 22:46:25 INFO SparkEnv: Registering OutputCommitCoordinator
19/01/03 22:46:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/01/03 22:46:26 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at
http://10.192.99.214:4040
19/01/03 22:46:26 INFO Executor: Starting executor ID driver on host localhost
19/01/03 22:46:26 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 55540.
19/01/03 22:46:26 INFO NettyBlockTransferService: Server created on 10.192.99.214:55540
19/01/03 22:46:26 INFO BlockManager: Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/01/03 22:46:26 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO BlockManagerMasterEndpoint: Registering block manager 10.192.99.214:55540 with 1992.9 MB RAM, BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.192.99.214, 55540, None)
19/01/03 22:46:26 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/C:/UDEMY/SparkJob/spark-warehouse/').
19/01/03 22:46:26 INFO SharedState: Warehouse path is 'file:/C:/UDEMY/SparkJob/spark-warehouse/'.
19/01/03 22:46:27 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
19/01/03 22:46:29 INFO SparkContext: Invoking stop() from shutdown hook
19/01/03 22:46:29 INFO SparkUI: Stopped Spark web UI at http://10.192.99.214:4040
19/01/03 22:46:29 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/01/03 22:46:29 INFO MemoryStore: MemoryStore cleared
19/01/03 22:46:29 INFO BlockManager: BlockManager stopped
19/01/03 22:46:29 INFO BlockManagerMaster: BlockManagerMaster stopped
19/01/03 22:46:29 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/01/03 22:46:29 INFO SparkContext: Successfully stopped SparkContext
19/01/03 22:46:29 INFO ShutdownHookManager: Shutdown hook called
19/01/03 22:46:29 INFO ShutdownHookManager: Deleting directory C:\Users\swilliam\AppData\Local\Temp\spark-c69bfb9b-f351-45af-9947-77950b23dd15
Picked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStore="C:\Program Files\SquirrelSQL\certificates\jssecacerts"
Upvotes: 1
Reputation: 1236
It is transformation. Please refer: https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html
A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.
Upvotes: 8