BlueTurtle
BlueTurtle

Reputation: 383

Why can't my GCP script/notebook find my file?

I have a working script that finds the data file when it is in the same directory as the script. This works both on my local machine and Google Colab.

When I try it on GCP though it can not find the file. I tried 3 approaches:

PySpark Notebook:

  1. Upload the .ipynb file which includes a wget command. This downloads the file without error but I am unsure where it saves it to and the script can not find the file either (I assume because I am telling it that the file is in the same directory and pressumably using wget on GCP saves it somewhere else by default.)

PySpark with bucket:

  1. I did the same as the PySpark notebook above but first I uploaded the dataset to the bucket and then used the two links provided in the file details when you click the file name inside the bucket on the console (neither worked). I would like to avoid this though as wget is much faster then downloading on my slow wifi then reuploading to the bucket through the console.

GCP SSH:

  1. Create cluster
  2. Access VM through SSH.
  3. Upload .py file using the cog icon
  4. wget the dataset and move both into the same folder
  5. Run script using python gcp.py

Just gives me an error saying file not found.

Thanks.

Upvotes: 0

Views: 408

Answers (1)

Vishal K
Vishal K

Reputation: 1464

As per your first and third approach, if you are running a PySpark code on Dataproc, irrespective of whether you use .ipynb file or .py file, please note the below points:

If you use the ‘wget’ command to download the file, then it will be downloaded in the current working directory where your code is executed.

When you try to access the file through the PySpark code, it will check defaultly in HDFS. If you want to access the downloaded file from the current working directory, use the “ file:///” URI with absolute file path.

If you want to access the file from HDFS, then you have to move the downloaded file to HDFS and then access from there using an absolute HDFS file path. Please refer the below example:

hadoop fs -put <local file_name> </HDFS/path/to/directory>

Upvotes: 1

Related Questions