Reputation: 61
I have a need to query from a parquet file where the column names are completely inconsistent. In order to remedy this issue and insure that my model gets exactly the data it expects I need to 'prefetch' the columns list then apply some regex patterns to qualify which columns I need to retrieve. In pseudocode:
PrefetchList = sqlContext.read.parquet(my_parquet_file).schema.fields
# Insert variable statements to check/qualify the columns against rules here
dfQualified = SELECT [PrefetchList] from parquet;
I've searched around to see if this is achievable but not had any success. If this is syntactically correct (or close) or if someone has other suggestions I am open to it.
Thanks
Upvotes: 0
Views: 685
Reputation: 5213
You can use either the schema
method but you can also use the .columns
method.
Notice that the select method in spark is a little odd, it's defined as def
select(col: String, cols: String*)
so you can't pass back to it select(fields:_*)
, and you'd have to use df.select(fields.head, fields.tail:_*)
which is kind of ugly, but luckily there's selectExpr(exprs: String*)
as an alternative. So this below will work. It takes only columns that begin with 'user'
fields = df.columns.filter(_.matches("^user.+")) // BYO regexp
df.selectExpr(fields:_*)
This of course assumes that df
contains your dataframe, loaded with sqlContext.read.parquet()
.
Upvotes: 1