DespicableMe
DespicableMe

Reputation: 61

Use a list to define SELECT columns in a query

I have a need to query from a parquet file where the column names are completely inconsistent. In order to remedy this issue and insure that my model gets exactly the data it expects I need to 'prefetch' the columns list then apply some regex patterns to qualify which columns I need to retrieve. In pseudocode:

PrefetchList = sqlContext.read.parquet(my_parquet_file).schema.fields
# Insert variable statements to check/qualify the columns against rules here
dfQualified = SELECT [PrefetchList] from parquet;

I've searched around to see if this is achievable but not had any success. If this is syntactically correct (or close) or if someone has other suggestions I am open to it.

Thanks

Upvotes: 0

Views: 685

Answers (1)

Roberto Congiu
Roberto Congiu

Reputation: 5213

You can use either the schema method but you can also use the .columns method. Notice that the select method in spark is a little odd, it's defined as def select(col: String, cols: String*) so you can't pass back to it select(fields:_*), and you'd have to use df.select(fields.head, fields.tail:_*) which is kind of ugly, but luckily there's selectExpr(exprs: String*) as an alternative. So this below will work. It takes only columns that begin with 'user'

fields = df.columns.filter(_.matches("^user.+")) // BYO regexp
df.selectExpr(fields:_*)

This of course assumes that df contains your dataframe, loaded with sqlContext.read.parquet().

Upvotes: 1

Related Questions