Tomasz Kaczmarski
Tomasz Kaczmarski

Reputation: 136

how to load file from google cloud to job

i stored file on drive "/content/drive/My Drive/BD-CW2" filename pickleRdd same as job read_rdd.py

but when i run job on cluster im getting

Traceback (most recent call last): File "/tmp/18dcd2bf5c104f01b6d25ea6919b7cfc/read_rdd.py", line 55, in read_RDD(sys.argv[1:]) File "/tmp/18dcd2bf5c104f01b6d25ea6919b7cfc/read_rdd.py", line 32, in read_RDD

code to read file inside job

RDDFromPickle  = open('pickleRdd', 'rb')

RDDFromPickle = pickle.load(RDDFromPickle)

how I can redirect above code it to read from drive(/content/drive/My Drive/BD-CW2) ? or move file from drive to cluster so job can access it ? all work fine when i run on colab only cannot access when i run on cluster

easiet way seems be to adjust

 RDDFromPickle  = open('/content/drive/My Drive/BD-CW2/pickleRdd', 'rb')

but how i can pass google drive location ?

Upvotes: 0

Views: 778

Answers (2)

Since you are using Google Cloud Platform, I guess you are deploying your pyspark file to Cloud Dataproc. If so, I suggest to upload your file to a buket in Google Cloud Storage and read from there this file using the code as follows (guess it's a CSV file):

from pyspark.sql import SparkSession

spark = SparkSession \
   .builder \
   .appName('dataproc-python-demo') \
   .getOrCreate()

df = spark.read.format("csv").option("header", 
     "false").load("gs://<bucket>/file.csv")

count_value = df.rdd.map(lambda line: (line._c0, line._c1)).count()

print(count_value)

In the code above it created a Dataframe and I turned it into RDD type to format the values, but you can also use the Dataframe type to do it.

Note that _c0 and _c1 is the default name of the columns it gets when the CSV files have no header. Once you got a similar code like this, you can submit it to your dataproc cluster this way:

gcloud dataproc jobs submit pyspark --cluser <cluster_name> --region 
<region, example us-central1> gs://<bucket>/yourpyfile.py

In order to submit a new job in Dataproc you can refer to this link [1].

[1] https://cloud.google.com/dataproc/docs/guides/submit-job#submitting_a_job

Upvotes: 1

t-dsai
t-dsai

Reputation: 33

Use module os with abspath as follows:

import os.path
RDDFromPickle = open(os.path.abspath('/content/drive/My Drive/BD-CW2/pickleRdd', 'rb'))
RDDFromPickle = pickle.load(RDDFromPickle)

Upvotes: 0

Related Questions