Reputation: 451
I want to check if several files exist in hdfs before load them by SparkContext.
I use pyspark. I tried
os.system("hadoop fs -test -e %s" %path)
but as I have a lot of paths to check, the job crashed.
I tried also sc.wholeTextFiles(parent_path)
and then filter by keys. but it crashed also because the parent_path contains a lot of sub paths and files.
Could you help me?
Upvotes: 8
Views: 12374
Reputation: 6154
One possibility is that you can use hadoop fs -lsr your_path
to get all the paths, and then check if the paths you're interested in are in that set.
Regarding your crash, it's possible it was a result of all the calls to os.system
, rather than being specific to the hadoop command. Sometimes calling an external process can result in issues related to buffers that are never getting released, in particular I/O buffers (stdin/stdout).
One solution would be to make a single call to a bash script that loops over all the paths. You can create the script using a string template in your code, fill in the array of paths in the script, write it, then execute.
It may also be a good idea to switch to the subprocess
module of python, which gives you more granular control over handling subprocesses. Here's the equivalent of os.system
:
process = subprocess.check_output(
args=your_script,
stdout=PIPE,
shell=True
)
Note that you can switch stdout
to something like a file handle if that helps you with debugging or making the process more robust. Also you can switch that shell=True
argument to False
unless you're going to call an actual script or use shell-specific things like pipes or redirection.
Upvotes: 0
Reputation: 838
Rigth how it says Tristan Reid:
...(Spark) It can read many formats, and it supports Hadoop glob expressions, which are terribly useful for reading from multiple paths in HDFS, but it doesn't have a builtin facility that I'm aware of for traversing directories or files, nor does it have utilities specific to interacting with Hadoop or HDFS.
Anyway, this is his answer to a related question: Pyspark: get list of files/directories on HDFS path
Once you have the list of files in a directory, it is easy to check if a particular file exist.
I hope it can help somehow.
Upvotes: 1