Reputation: 73
The input dataframe may not always have all the columns. In SQL or SCALA, I want to create a select statement where even if the dataframe does not have column, it won't error out and it will only output the columns that do exist.
For example, this statement will work.
Select store, prod, distance from table
+-----+------+--------+
|store|prod |distance|
+-----+------+--------+
|51 |42 |2 |
|51 |42 |5 |
|89 |44 |9 |
If the dataframe looks like below, I want the same statement to work, to just ignore what's not there, and just output the existing columns (in this case 'store' and 'prod')
+-----+------+
|store|prod |
+-----+------+
|51 |42 |
|51 |42 |
|89 |44 |
Upvotes: 2
Views: 4670
Reputation: 1
Old question but the top answer doesn't work, and the second one only works if you are looking for one.
I have an array of column names, and i want to select them but they won't always exist.
import org.apache.spark.sql.functions.col
val columnNames = Seq("column1", "column2", "column3"...)
This will grab all of the ones that you want if they exist.
val columnsToSelect = columnNames.filter(df.columns.contains)
df = df.select(columnsToSelect.map(col): _*)
Upvotes: 0
Reputation: 1932
You can have list of all cols in list, either hard coded or prepare from other meta data and use intersect
val columnNames = Seq("c1","c2","c3","c4")
df.select( df.columns.intersect(columnNames).map(x=>col(x)): _* ).show()
Upvotes: 2
Reputation: 26046
You can make use of columns
method on Dataframe
. This would look like that:
val result = if(df.columns.contains("distance")) df.select("store", "prod", "distance")
else df.select("store", "prod")
Edit:
Having many such columns, you can keep them in array, for example cols
and filter
it:
val selectedCols = cols.filter(col -> df.columns.contains("distance")).map(col)
val result = df.select(selectedCols:_*)
Upvotes: 2
Reputation: 8711
Assuming you use the expanded SQL template, like select a,b,c from tab
, you could do something like below to get the required results.
Check this out
scala> val df2 = Seq((51,42),(51,42),(89,44)).toDF("store","prod")
df2: org.apache.spark.sql.DataFrame = [store: int, prod: int]
scala> df2.createOrReplaceTempView("tab2")
scala> val sel_query="Select store, prod, distance from tab2".toLowerCase
sel_query: String = select store, prod, distance from tab2
scala> val tabl_parse = sel_query.split("[ ,]+").filter(_!="select").filter(_!="from")
tabl_parse: Array[String] = Array(store, prod, distance, tab2)
scala> val tab_name=tabl_parse(tabl_parse.size-1)
tab_name: String = tab2
scala> val tab_cols = (0 until tabl_parse.size-1).map(tabl_parse(_))
tab_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod, distance)
scala> val reqd_cols = tab_cols.filter( x=>df2.columns.contains(x))
reqd_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod)
scala> val reqd_sel_string = "select " + reqd_cols.mkString(",") + " from " + tab_name
reqd_sel_string: String = select store,prod from tab2
scala> spark.sql(reqd_sel_string).show(false)
+-----+----+
|store|prod|
+-----+----+
|51 |42 |
|51 |42 |
|89 |44 |
+-----+----+
scala>
Upvotes: 1