Reputation: 1635
I have a spark dataframe with the following column structure:
UT_LVL_17_CD,UT_LVL_20_CD, 2017 1Q,2017 2Q,2017 3Q,2017 4Q, 2017 FY,2018 1Q, 2018 2Q,2018 3Q,2018 4Q,2018 FY
In the above column structure , I will get new columns with subsequent quarters like 2019 1Q , 2019 2Q
etc
I want to select UT_LVL_17_CD,UT_LVL_20_CD
and columns which has the pattern year<space>quarter
like 2017 1Q
.
Basically I want to avoid selecting columns like 2017 FY , 2018 FY
, and this has to be dynamic as I will get new FY data each year
I am using spark 2.4.4
Upvotes: 0
Views: 5177
Reputation: 11
you can try this snippet. Assuming the DF is your dataframe which consists of all those columns.
var DF1 = DF.select(DF.columns.filter(x => !x.contains("FY")).map(DF(_)) : _*)
This will remove those FY related columns. Hope this works for you.
Upvotes: 0
Reputation: 40370
Like I stated in my comment, this can be done with plain scala using Regex
since the DataFrame
can return columns names as a Seq[String]
:
scala> val columns = df.columns
// columns: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2017 FY, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q, 2018 FY)
scala> val regex = """^((?!FY).)*$""".r
// regex: scala.util.matching.Regex = ^((?!FY).)*$
scala> val selection = columns.filter(s => regex.findFirstIn(s).isDefined)
// selection: Seq[String] = List(UT_LVL_17_CD, UT_LVL_20_CD, 2017 1Q, 2017 2Q, 2017 3Q, 2017 4Q, 2018 1Q, 2018 2Q, 2018 3Q, 2018 4Q)
You can check that the selected columns does not contain the unwanted columns :
scala> columns.diff(selection)
// res2: Seq[String] = List(2017 FY, 2018 FY)
Now you can use the selection
:
scala> df.select(selection.head, selection.tail : _*)
// res3: org.apache.spark.sql.DataFrame = [UT_LVL_17_CD: int, UT_LVL_20_CD: int ... 8 more fields]
Upvotes: 4
Reputation: 401
You could use desc sql command to get list of column names
val fyStringList=new util.ArrayList[String]()
spark.sql("desc <table_name>").select("col_name").filter(str => str.getString(0).toLowerCase.contains("fy")).collect.foreach(str=>fyStringList.add(str.getString(0)))
println(fyStringList)
Use above snippet to get list of column name which contains "fy" You can update filter logic with regex and also update logic in forEach for storing string columns
Upvotes: 0