Ayan Biswas
Ayan Biswas

Reputation: 1635

Select Columns in Spark Dataframe based on Column name pattern

I have a spark dataframe with the following column structure:

UT_LVL_17_CD,UT_LVL_20_CD, 2017 1Q,2017 2Q,2017 3Q,2017 4Q, 2017 FY,2018 1Q, 2018 2Q,2018 3Q,2018 4Q,2018 FY

In the above column structure , I will get new columns with subsequent quarters like 2019 1Q , 2019 2Q etc

I want to select UT_LVL_17_CD,UT_LVL_20_CD and columns which has the pattern year<space>quarter like 2017 1Q. Basically I want to avoid selecting columns like 2017 FY , 2018 FY , and this has to be dynamic as I will get new FY data each year

I am using spark 2.4.4

Upvotes: 0

Views: 5177

Answers (3)

Ruthika jawar
Ruthika jawar

Reputation: 11

you can try this snippet. Assuming the DF is your dataframe which consists of all those columns.

var DF1 =  DF.select(DF.columns.filter(x => !x.contains("FY")).map(DF(_)) : _*)

This will remove those FY related columns. Hope this works for you.

Upvotes: 0

eliasah
eliasah

Reputation: 40370

Like I stated in my comment, this can be done with plain scala using Regex since the DataFrame can return columns names as a Seq[String] :

scala> val columns = df.columns 
// columns: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2017 FY, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q, 2018 FY)

scala> val regex = """^((?!FY).)*$""".r
// regex: scala.util.matching.Regex = ^((?!FY).)*$

scala> val selection = columns.filter(s => regex.findFirstIn(s).isDefined)
// selection: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q)

You can check that the selected columns does not contain the unwanted columns :

scala> columns.diff(selection)
// res2: Seq[String] = List(2017 FY, 2018 FY)

Now you can use the selection :

scala> df.select(selection.head, selection.tail : _*)
// res3: org.apache.spark.sql.DataFrame = [UT_LVL_17_CD: int, UT_LVL_20_CD: int ... 8 more fields]

Upvotes: 4

Tejas
Tejas

Reputation: 401

You could use desc sql command to get list of column names

    val fyStringList=new util.ArrayList[String]()
    spark.sql("desc <table_name>").select("col_name").filter(str => str.getString(0).toLowerCase.contains("fy")).collect.foreach(str=>fyStringList.add(str.getString(0)))
    println(fyStringList)

Use above snippet to get list of column name which contains "fy" You can update filter logic with regex and also update logic in forEach for storing string columns

Upvotes: 0

Related Questions