Recursively Read Files Spark wholeTextFiles

Question

I have a directory in an azure data lake that has the following path:

'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib'

Within this directory there are a number of other directories (50) that have the format 20190404.

The directory 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/20180404' contains 100 or so xml files which I am working with.

I can create an rdd for each of the sub-folders which works fine, but ideally I want to pass only the top path, and have spark recursively find the files. I have read other SO posts and tried using a wildcard thus:

pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*'
rdd = sc.wholeTextFiles(pathWild)
rdd.count()

But it just freezes and does nothing at all, seems to completely destroy the kernel. I am working in Jupyter on Spark 2.x. New to spark. Thanks!

grepIt · Accepted Answer

Try this:

pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*/*'

Recursively Read Files Spark wholeTextFiles

Answers (1)

Related Questions