Reputation: 389
I want to list all the parquet files in adls folder.
dbutils.fs.ls("abfss://path/to/raw/files/*.parquet")
Is there a way to make the above statement work?
Upvotes: 9
Views: 18620
Reputation: 71
You cannot use wildcards directly with the dbutils.fs.ls command, but you can get all the files in a directory and then use a simple list comprehension to filter down to the files of interest. For example, to get a list of all the files that end with the extension of interest:
lst=[c[0] for c in dbutils.fs.ls("abfss://path/to/raw/files/) if c[0].split("/")[-1].endswith(".parquet")]
Or using regular expressions:
import re
lst=[c.path for c in dbutils.fs.ls("abfss://path/to/raw/files/) if re.search("\.parquet$",c.path) is not None]
Upvotes: 7
Reputation: 1624
I ended up using this code to achieve filtering the paths by a glob pattern:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI
def listFiles(basep: String, globp: String): Seq[String] = {
val conf = new Configuration(sc.hadoopConfiguration)
val fs = FileSystem.get(new URI(basep), conf)
def validated(path: String): Path = {
if(path startsWith "/") new Path(path)
else new Path("/" + path)
}
val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
hadoopConf = conf,
filter = null,
sparkSession = spark, areRootPaths=true)
// If you are using Databricks Runtime 6.x and below,
// remove <areRootPaths=true> from the bulkListLeafFiles function parameter.
fileCatalog.flatMap(_._2.map(_.path))
}
val root = ""abfss://path/to/raw/files/"
val globp = "*.parquet" // glob pattern, e.g. "service=webapp/date=2019-03-31/*log4j*"
val files = listFiles(root, globp)
display(files.toDF("path"))
Unfortunatly the other proposed answers did not work out for me in Databricks. Hence this approach. For a detailed explanation please refer to the source here.
Upvotes: 1
Reputation: 454
You can use Magic Commands to use shell commands to use wild card syntax.
For example, you can use this in a Databricks cell:
%sh
ls /dbfs/mnt/mountpoint/path/to/raw/*.parquet
Upvotes: 7
Reputation: 4544
Use it like this:
path="abfss://path/to/raw/files/*.parquet"
filelist=dbutils.fs.ls(path)
print(filelist)
The above code will print the name of all parquet files in the given path.
Upvotes: 0