A7med
A7med

Reputation: 451

pyspark : how to check if a file exists in hdfs

I want to check if several files exist in hdfs before load them by SparkContext. I use pyspark. I tried os.system("hadoop fs -test -e %s" %path) but as I have a lot of paths to check, the job crashed. I tried also sc.wholeTextFiles(parent_path) and then filter by keys. but it crashed also because the parent_path contains a lot of sub paths and files. Could you help me?

Upvotes: 8

Views: 12374

Answers (3)

Tristan Reid
Tristan Reid

Reputation: 6154

One possibility is that you can use hadoop fs -lsr your_path to get all the paths, and then check if the paths you're interested in are in that set.

Regarding your crash, it's possible it was a result of all the calls to os.system, rather than being specific to the hadoop command. Sometimes calling an external process can result in issues related to buffers that are never getting released, in particular I/O buffers (stdin/stdout).

One solution would be to make a single call to a bash script that loops over all the paths. You can create the script using a string template in your code, fill in the array of paths in the script, write it, then execute.

It may also be a good idea to switch to the subprocess module of python, which gives you more granular control over handling subprocesses. Here's the equivalent of os.system:

process = subprocess.check_output(
        args=your_script,
        stdout=PIPE,
        shell=True
    )

Note that you can switch stdout to something like a file handle if that helps you with debugging or making the process more robust. Also you can switch that shell=True argument to False unless you're going to call an actual script or use shell-specific things like pipes or redirection.

Upvotes: 0

David
David

Reputation: 11573

Have you tried using pydoop? The exists function should work

Upvotes: 0

Josemy
Josemy

Reputation: 838

Rigth how it says Tristan Reid:

...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.

Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path

Once you have the list of files in a directory, it is easy to check if a particular file exist.

I hope it can help somehow.

Upvotes: 1

Related Questions