touelv
touelv

Reputation: 73

Select field only if it exists (SQL or Scala)

The input dataframe may not always have all the columns. In SQL or SCALA, I want to create a select statement where even if the dataframe does not have column, it won't error out and it will only output the columns that do exist.

For example, this statement will work.

Select store, prod, distance from table

+-----+------+--------+
|store|prod  |distance|
+-----+------+--------+
|51   |42    |2     |
|51   |42    |5     |
|89   |44    |9     |

If the dataframe looks like below, I want the same statement to work, to just ignore what's not there, and just output the existing columns (in this case 'store' and 'prod')

+-----+------+
|store|prod  |
+-----+------+
|51   |42    |
|51   |42    |
|89   |44    |

Upvotes: 2

Views: 4670

Answers (4)

wreckingguac
wreckingguac

Reputation: 1

Old question but the top answer doesn't work, and the second one only works if you are looking for one.

I have an array of column names, and i want to select them but they won't always exist.

import org.apache.spark.sql.functions.col

val columnNames = Seq("column1", "column2", "column3"...)

This will grab all of the ones that you want if they exist.

val columnsToSelect = columnNames.filter(df.columns.contains)

df = df.select(columnsToSelect.map(col): _*)

Upvotes: 0

Ranga Vure
Ranga Vure

Reputation: 1932

You can have list of all cols in list, either hard coded or prepare from other meta data and use intersect

val columnNames = Seq("c1","c2","c3","c4")

df.select( df.columns.intersect(columnNames).map(x=>col(x)): _* ).show()

Upvotes: 2

Andronicus
Andronicus

Reputation: 26046

You can make use of columns method on Dataframe. This would look like that:

val result = if(df.columns.contains("distance")) df.select("store", "prod", "distance") 
             else df.select("store", "prod")

Edit:

Having many such columns, you can keep them in array, for example cols and filter it:

val selectedCols = cols.filter(col -> df.columns.contains("distance")).map(col)
val result = df.select(selectedCols:_*)

Upvotes: 2

stack0114106
stack0114106

Reputation: 8711

Assuming you use the expanded SQL template, like select a,b,c from tab, you could do something like below to get the required results.

  1. Get the sql string and convert it to lowercase.
  2. Split the sql on space or comma to get the individual words in an array
  3. Remove "select" and "from" from the above array as they are SQL keywords.
  4. Now your last index is the table name
  5. First to last index but one contains the list of select columns.
  6. To get the required columns, just filter it against df2.columns. The columns that are there in SQL but not in table will be filtered out
  7. Now construct the sql using the individual pieces.
  8. Run it using spark.sql(reqd_sel_string) to get the results.

Check this out

scala> val df2 = Seq((51,42),(51,42),(89,44)).toDF("store","prod")
df2: org.apache.spark.sql.DataFrame = [store: int, prod: int]

scala> df2.createOrReplaceTempView("tab2")

scala> val sel_query="Select store, prod, distance from tab2".toLowerCase
sel_query: String = select store, prod, distance from tab2

scala> val tabl_parse = sel_query.split("[ ,]+").filter(_!="select").filter(_!="from")
tabl_parse: Array[String] = Array(store, prod, distance, tab2)

scala> val tab_name=tabl_parse(tabl_parse.size-1)
tab_name: String = tab2

scala> val tab_cols = (0 until tabl_parse.size-1).map(tabl_parse(_))
tab_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod, distance)

scala> val reqd_cols = tab_cols.filter( x=>df2.columns.contains(x))
reqd_cols: scala.collection.immutable.IndexedSeq[String] = Vector(store, prod)

scala> val reqd_sel_string = "select " + reqd_cols.mkString(",") + " from " + tab_name
reqd_sel_string: String = select store,prod from tab2

scala> spark.sql(reqd_sel_string).show(false)
+-----+----+
|store|prod|
+-----+----+
|51   |42  |
|51   |42  |
|89   |44  |
+-----+----+


scala>

Upvotes: 1

Related Questions