qed
qed

Reputation: 23134

Convert an Rows or Columns to a dataframe

I know I can extract columns like this:

  userData1.select(userData1("job"))

But what if I already have a column, or an array of columns, how do I get a dataframe out of it? What has worked for me so far is:

  userData1.select(userData1("id"), userData1("age"))

This is a bit verbose and ugly compared to what you can do in R:

userData1[, c("id", "age")]

And what about rows? For example:

  userData1.head(5)

How do you convert this into a new dataframe?

Upvotes: 3

Views: 538

Answers (1)

zero323
zero323

Reputation: 330363

To select multiple columns you can use varargs syntax:

import org.apache.spark.sql.DataFrame

val df: DataFrame = sc.parallelize(Seq(
  (1, "x", 2.0), (2, "y", 3.0), (3, "z", 4.0)
)).toDF("foo", "bar", "baz")


// or df.select(Seq("foo", "baz") map col: _*)
val fooAndBaz: DataFrame = df.select(Seq($"foo", $"baz"): _*)
fooAndBaz.show

// +---+---+
// |foo|baz|
// +---+---+
// |  1|2.0|
// |  2|3.0|
// |  3|4.0|
// +---+---+

PySpark equivalent is arguments unpacking:

df = sqlContext.createDataFrame(
    [(1, "x", 2.0), (2, "y", 3.0), (3, "z", 4.0)],
    ("foo", "bar", "baz"))

df.select(*["foo", "baz"]).show()

## +---+---+
## |foo|baz|
## +---+---+
## |  1|2.0|
## |  2|3.0|
## |  3|4.0|
## +---+---+

To limit number of rows without collecting you can use limit method:

val firstTwo: DataFrame = df.limit(2)
firstTwo.show

//  +---+---+---+
//  |foo|bar|baz|
//  +---+---+---+
//  |  1|  x|2.0|
//  |  2|  y|3.0|
//  +---+---+---+

which is equivalent to SQL LIMIT clause:

df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df LIMIT 2").show

//  +---+---+---+
//  |foo|bar|baz|
//  +---+---+---+
//  |  1|  x|2.0|
//  |  2|  y|3.0|
//  +---+---+---+

Upvotes: 1

Related Questions