Woody Pride
Woody Pride

Reputation: 13955

Recursively Read Files Spark wholeTextFiles

I have a directory in an azure data lake that has the following path:

'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib'

Within this directory there are a number of other directories (50) that have the format 20190404.

The directory 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/20180404' contains 100 or so xml files which I am working with.

I can create an rdd for each of the sub-folders which works fine, but ideally I want to pass only the top path, and have spark recursively find the files. I have read other SO posts and tried using a wildcard thus:

pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*'
rdd = sc.wholeTextFiles(pathWild)
rdd.count()

But it just freezes and does nothing at all, seems to completely destroy the kernel. I am working in Jupyter on Spark 2.x. New to spark. Thanks!

Upvotes: 2

Views: 1931

Answers (1)

grepIt
grepIt

Reputation: 116

Try this:

pathWild = 'adl://home/../psgdata/clusters/iptiqadata-prod-cluster-eus2-01/psgdata/mib/*/*'

Upvotes: 2

Related Questions