What is the different between DataFrame.select() and DataFrame.toDF() in Spark SQL

Question

It seems that they all return a new DataFrame

Source code:

def toDF(self, *cols):
    jdf = self._jdf.toDF(self._jseq(cols))
    return DataFrame(jdf, self.sql_ctx)


def select(self, *cols):
    jdf = self._jdf.select(self._jcols(*cols))
    return DataFrame(jdf, self.sql_ctx)

Fokko Driesprong · Accepted Answer

The difference is subtle.

If you for example convert an unnamed tuple ("Pete", 22) to a DataFrame using .toDF("name", "age"), and you can also rename the dataframe by invoking the toDF method again. For example:

scala> val rdd = sc.parallelize(List(("Piter", 22), ("Gurbe", 27)))
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[2] at parallelize at :27

scala> val df = rdd.toDF("name", "age")
df: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> df.show()
+-----+---+
| name|age|
+-----+---+
|Piter| 22|
|Gurbe| 27|
+-----+---+

scala> val df = rdd.toDF("person", "age")
df: org.apache.spark.sql.DataFrame = [person: string, age: int]

scala> df.show()
+------+---+
|person|age|
+------+---+
| Piter| 22|
| Gurbe| 27|
+------+---+

Using the select you can select columns, which can be later used to project the table, or to save only the columns that you need:

scala> df.select("age").show()
+---+
|age|
+---+
| 22|
| 27|
+---+

scala> df.select("age").write.save("/tmp/ages.parquet")
Scaling row group sizes to 88.37% for 8 writers.

Hope this helps!

What is the different between DataFrame.select() and DataFrame.toDF() in Spark SQL

Answers (1)

Related Questions