Reputation: 2652
In spark SQL (perhaps only HiveQL) one can do:
select sex, avg(age) as avg_age
from humans
group by sex
which would result in a DataFrame
with columns named "sex"
and "avg_age"
.
How can avg(age)
be aliased to "avg_age"
without using textual SQL?
Edit: After zero323 's answer, I need to add the constraint that:
The column-to-be-renamed's name may not be known/guaranteed or even addressable. In textual SQL, using "select EXPR as NAME" removes the requirement to have an intermediate name for EXPR. This is also the case in the example above, where "avg(age)" could get a variety of auto-generated names (which also vary among spark releases and sql-context backends).
Upvotes: 27
Views: 93823
Reputation: 330063
If you prefer to rename a single column it is possible to use withColumnRenamed
method:
case class Person(name: String, age: Int)
val df = sqlContext.createDataFrame(
Person("Alice", 2) :: Person("Bob", 5) :: Nil)
df.withColumnRenamed("name", "first_name")
Alternatively you can use alias
method:
import org.apache.spark.sql.functions.avg
df.select(avg($"age").alias("average_age"))
You can take it further with small helper:
import org.apache.spark.sql.Column
def normalizeName(c: Column) = {
val pattern = "\\W+".r
c.alias(pattern.replaceAllIn(c.toString, "_"))
}
df.select(normalizeName(avg($"age")))
Upvotes: 22
Reputation: 694
Let's suppose human_df
is the DataFrame for humans. Since Spark 1.3:
human_df.groupBy("sex").agg(avg("age").alias("avg_age"))
Upvotes: 37
Reputation: 13528
Anonymous columns, such as the one that would be generated by avg(age)
without AS avg_age
, get automatically assigned names. As you point out in your question, the names are implementation-specific, generated by a naming strategy. If needed, you could write code that sniffs the environment and instantiates an appropriate discovery & renaming strategy based on the specific naming strategy. There are not many of them.
In Spark 1.4.1 with HiveContext
, the format is "_cN" where N is the position of the anonymous column in the table. In your case, the name would be _c1
.
Upvotes: 1
Reputation: 2652
Turns out def toDF(colNames: String*): DataFrame
does exactly that. Pasting from 2.11.7 documentation:
def toDF(colNames: String*): DataFrame
Returns a new DataFrame with columns renamed. This can be quite
convenient in conversion from a RDD of tuples into a DataFrame
with meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame
// with column name _1 and _2
rdd.toDF("id", "name") // this creates a DataFrame with
// column name "id" and "name"
Upvotes: 10