DLYT
DLYT

Reputation: 1

Apache Spark Scala - data analysis - error

I am new to/still learning Apache Spark/Scala. I am trying to analyze a dataset and have loaded the dataset into Scala. However, when I try to perform a basic analysis such as max, min or average, I get an error -

error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]

Could anyone please shed some light on this please? I am running Spark on the cloudlab of an organization.

Code:

// Reading in the csv file

val df = sc.textFile("/user/Spark/PortbankRTD.csv").map(x => x.split(","))  

// Select Max of Age

df.select(max($"age")).show()                                                                                                        

Error:

<console>:40: error: value select is not a member of org.apache.spark.rdd.RDD[Array[String]]                                                
          df.select(max($"age")).show()  

Please let me know if you need any more information. Thanks

Upvotes: 0

Views: 247

Answers (1)

user4601931
user4601931

Reputation: 5304

Following up on my comment, the textFile method returns an RDD[String]. select is a method on DataFrame. You will need to convert your RDD[String] into a DataFrame. You can do this in a number of ways. One example is

import spark.implicits._

val rdd = sc.textFile("/user/Spark/PortbankRTD.csv")
val df = rdd.toDF()

There are also built-in readers for many types of input files:

spark.read.csv("/user/Spark/PortbankRTD.csv")

returns a DataFrame immediately.

Upvotes: 3

Related Questions