Reputation: 597
Right now the databricks autoloader requires a directory path where all the files will be loaded from. But in case some other kind of log files also start coming in in that directory - is there a way to ask Autoloader to exclude those files while preparing dataframe?
df = spark.readStream.format("cloudFiles") \
.option(<cloudFiles-option>, <option-value>) \
.schema(<schema>) \
.load(<input-path>)
Upvotes: 2
Views: 3077
Reputation: 556
Use pathGlobFilter
as one of the option and provide a regex to filter a file type or file with specific name.
For instance, to skip files with filename as A1.csv, A2.csv .... A9.csv from load location, the value for pathGlobFilter
will look like:
df = spark.read.load("/file/load/location,
format="csv",
schema=schema,
pathGlobFilter="A[0-9].csv")
Upvotes: -1
Reputation: 87259
Autoloader supports specification of the glob string as <input-path>
- from documentation:
<input-path>
can contain file glob patterns
Glob syntax support different options, like, *
for any character, etc. So you can specify input-path
as, path/*.json
for example. You can exclude files as well, but building that pattern could be slightly more complicated, compared to inclusion pattern, but it's still possible - for example, *.[^l][^o][^g]
should exclude files with .log
extension
Upvotes: 5