Can we exclude or include only particular file extensions from Databricks Autoloader?

Question

Right now the databricks autoloader requires a directory path where all the files will be loaded from. But in case some other kind of log files also start coming in in that directory - is there a way to ask Autoloader to exclude those files while preparing dataframe?

df = spark.readStream.format("cloudFiles") \
  .option(, ) \
  .schema() \
  .load()

Alex Ott · Accepted Answer

Autoloader supports specification of the glob string as - from documentation:

can contain file glob patterns

Glob syntax support different options, like, * for any character, etc. So you can specify input-path as, path/*.json for example. You can exclude files as well, but building that pattern could be slightly more complicated, compared to inclusion pattern, but it's still possible - for example, *.[^l][^o][^g] should exclude files with .log extension

Can we exclude or include only particular file extensions from Databricks Autoloader?

Answers (2)

Related Questions