Reputation: 2652

Is it possible to alias columns programmatically in spark sql?

In spark SQL (perhaps only HiveQL) one can do:

select sex, avg(age) as avg_age
from humans
group by sex

which would result in a DataFrame with columns named "sex" and "avg_age".

How can avg(age) be aliased to "avg_age" without using textual SQL?

Edit: After zero323 's answer, I need to add the constraint that:

The column-to-be-renamed's name may not be known/guaranteed or even addressable. In textual SQL, using "select EXPR as NAME" removes the requirement to have an intermediate name for EXPR. This is also the case in the example above, where "avg(age)" could get a variety of auto-generated names (which also vary among spark releases and sql-context backends).

Upvotes: 27

Answers (4)

zero323

Reputation: 330063

If you prefer to rename a single column it is possible to use withColumnRenamed method:

case class Person(name: String, age: Int)

val df = sqlContext.createDataFrame(
    Person("Alice", 2) :: Person("Bob", 5) :: Nil) 
df.withColumnRenamed("name", "first_name")

Alternatively you can use alias method:

import org.apache.spark.sql.functions.avg

df.select(avg($"age").alias("average_age"))

You can take it further with small helper:

import org.apache.spark.sql.Column

def normalizeName(c: Column) = {
  val pattern = "\\W+".r
  c.alias(pattern.replaceAllIn(c.toString, "_"))
}

df.select(normalizeName(avg($"age")))

Upvotes: 22

Robert Chevallier

Reputation: 694

Let's suppose human_df is the DataFrame for humans. Since Spark 1.3:

human_df.groupBy("sex").agg(avg("age").alias("avg_age"))

Upvotes: 37

Sim

Reputation: 13528

Anonymous columns, such as the one that would be generated by avg(age) without AS avg_age, get automatically assigned names. As you point out in your question, the names are implementation-specific, generated by a naming strategy. If needed, you could write code that sniffs the environment and instantiates an appropriate discovery & renaming strategy based on the specific naming strategy. There are not many of them.

In Spark 1.4.1 with HiveContext, the format is "_cN" where N is the position of the anonymous column in the table. In your case, the name would be _c1.

Upvotes: 1

Prikso NAI

Reputation: 2652

Turns out def toDF(colNames: String*): DataFrame does exactly that. Pasting from 2.11.7 documentation:

def toDF(colNames: String*): DataFrame

Returns a new DataFrame with columns renamed. This can be quite
convenient in conversion from a RDD of tuples into a DataFrame
with meaningful names. For example:

    val rdd: RDD[(Int, String)] = ...
    rdd.toDF()  // this implicit conversion creates a DataFrame
                // with column name _1 and _2
    rdd.toDF("id", "name")  // this creates a DataFrame with
                            // column name "id" and "name"

Upvotes: 10

Is it possible to alias columns programmatically in spark sql?

Answers (4)

Related Questions