Reputation: 7728
I'm working in Azure Synapse Notebooks and reading reading file(s) into a Dataframe from a well-formed folder path like so:
Given there are many folders references by that wildcard, how do I capture the "State" value as a column in the resulting Dataframe?
Upvotes: 1
Views: 8167
Reputation: 15258
No need to use the wildcard *
.
try : df = spark.read.load("abfss://....dfs.core.windows.net/")
Spark can read partitionned folders directly, and df
should then contains the column state
with its different values.
Upvotes: 0
Reputation: 2936
Use input_file_name
function to get the full input path and then apply regexp_extract
to extract the part that you want.
Example:
df.withColumn("filepath", F.input_file_name())
df.withColum("filepath", F.regexp_extract("filepath", "State=(.+)\.snappy\.parquet", 1)
Upvotes: 3