Reputation: 23134
I know I can extract columns like this:
userData1.select(userData1("job"))
But what if I already have a column, or an array of columns, how do I get a dataframe out of it? What has worked for me so far is:
userData1.select(userData1("id"), userData1("age"))
This is a bit verbose and ugly compared to what you can do in R:
userData1[, c("id", "age")]
And what about rows? For example:
userData1.head(5)
How do you convert this into a new dataframe?
Upvotes: 3
Views: 538
Reputation: 330363
To select multiple columns you can use varargs syntax:
import org.apache.spark.sql.DataFrame
val df: DataFrame = sc.parallelize(Seq(
(1, "x", 2.0), (2, "y", 3.0), (3, "z", 4.0)
)).toDF("foo", "bar", "baz")
// or df.select(Seq("foo", "baz") map col: _*)
val fooAndBaz: DataFrame = df.select(Seq($"foo", $"baz"): _*)
fooAndBaz.show
// +---+---+
// |foo|baz|
// +---+---+
// | 1|2.0|
// | 2|3.0|
// | 3|4.0|
// +---+---+
PySpark equivalent is arguments unpacking:
df = sqlContext.createDataFrame(
[(1, "x", 2.0), (2, "y", 3.0), (3, "z", 4.0)],
("foo", "bar", "baz"))
df.select(*["foo", "baz"]).show()
## +---+---+
## |foo|baz|
## +---+---+
## | 1|2.0|
## | 2|3.0|
## | 3|4.0|
## +---+---+
To limit number of rows without collecting you can use limit method:
val firstTwo: DataFrame = df.limit(2)
firstTwo.show
// +---+---+---+
// |foo|bar|baz|
// +---+---+---+
// | 1| x|2.0|
// | 2| y|3.0|
// +---+---+---+
which is equivalent to SQL LIMIT
clause:
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df LIMIT 2").show
// +---+---+---+
// |foo|bar|baz|
// +---+---+---+
// | 1| x|2.0|
// | 2| y|3.0|
// +---+---+---+
Upvotes: 1