user19930511
user19930511

Reputation: 389

How to list files using wildcard in databricks

I want to list all the parquet files in adls folder.

dbutils.fs.ls("abfss://path/to/raw/files/*.parquet") 

Is there a way to make the above statement work?

Upvotes: 9

Views: 18620

Answers (4)

Michael Peo
Michael Peo

Reputation: 71

You cannot use wildcards directly with the dbutils.fs.ls command, but you can get all the files in a directory and then use a simple list comprehension to filter down to the files of interest. For example, to get a list of all the files that end with the extension of interest:

lst=[c[0] for c in dbutils.fs.ls("abfss://path/to/raw/files/) if c[0].split("/")[-1].endswith(".parquet")]

Or using regular expressions:

import re
lst=[c.path for c in dbutils.fs.ls("abfss://path/to/raw/files/) if re.search("\.parquet$",c.path) is not None]

Upvotes: 7

DataBach
DataBach

Reputation: 1624

I ended up using this code to achieve filtering the paths by a glob pattern:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{Path, FileSystem}
import org.apache.spark.deploy.SparkHadoopUtil
import org.apache.spark.sql.execution.datasources.InMemoryFileIndex
import java.net.URI

def listFiles(basep: String, globp: String): Seq[String] = {
  val conf = new Configuration(sc.hadoopConfiguration)
  val fs = FileSystem.get(new URI(basep), conf)

  def validated(path: String): Path = {
    if(path startsWith "/") new Path(path)
    else new Path("/" + path)
  }

  val fileCatalog = InMemoryFileIndex.bulkListLeafFiles(
    paths = SparkHadoopUtil.get.globPath(fs, Path.mergePaths(validated(basep), validated(globp))),
    hadoopConf = conf,
    filter = null,
    sparkSession = spark, areRootPaths=true)

 // If you are using Databricks Runtime 6.x and below,
 // remove <areRootPaths=true> from the bulkListLeafFiles function parameter.

  fileCatalog.flatMap(_._2.map(_.path))
}

val root = ""abfss://path/to/raw/files/"
val globp = "*.parquet" // glob pattern, e.g. "service=webapp/date=2019-03-31/*log4j*"

val files = listFiles(root, globp)
display(files.toDF("path"))

Unfortunatly the other proposed answers did not work out for me in Databricks. Hence this approach. For a detailed explanation please refer to the source here.

Upvotes: 1

partha_devArch
partha_devArch

Reputation: 454

You can use Magic Commands to use shell commands to use wild card syntax.

For example, you can use this in a Databricks cell:

%sh
ls /dbfs/mnt/mountpoint/path/to/raw/*.parquet

Upvotes: 7

Utkarsh Pal
Utkarsh Pal

Reputation: 4544

Use it like this:

path="abfss://path/to/raw/files/*.parquet"
filelist=dbutils.fs.ls(path)
print(filelist)

The above code will print the name of all parquet files in the given path.

Upvotes: 0

Related Questions