Reputation: 136
i stored file on drive "/content/drive/My Drive/BD-CW2" filename pickleRdd same as job read_rdd.py
but when i run job on cluster im getting
Traceback (most recent call last): File "/tmp/18dcd2bf5c104f01b6d25ea6919b7cfc/read_rdd.py", line 55, in read_RDD(sys.argv[1:]) File "/tmp/18dcd2bf5c104f01b6d25ea6919b7cfc/read_rdd.py", line 32, in read_RDD
code to read file inside job
RDDFromPickle = open('pickleRdd', 'rb')
RDDFromPickle = pickle.load(RDDFromPickle)
how I can redirect above code it to read from drive(/content/drive/My Drive/BD-CW2) ? or move file from drive to cluster so job can access it ? all work fine when i run on colab only cannot access when i run on cluster
easiet way seems be to adjust
RDDFromPickle = open('/content/drive/My Drive/BD-CW2/pickleRdd', 'rb')
but how i can pass google drive location ?
Upvotes: 0
Views: 778
Reputation: 416
Since you are using Google Cloud Platform, I guess you are deploying your pyspark file to Cloud Dataproc. If so, I suggest to upload your file to a buket in Google Cloud Storage and read from there this file using the code as follows (guess it's a CSV file):
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('dataproc-python-demo') \
.getOrCreate()
df = spark.read.format("csv").option("header",
"false").load("gs://<bucket>/file.csv")
count_value = df.rdd.map(lambda line: (line._c0, line._c1)).count()
print(count_value)
In the code above it created a Dataframe and I turned it into RDD type to format the values, but you can also use the Dataframe type to do it.
Note that _c0 and _c1 is the default name of the columns it gets when the CSV files have no header. Once you got a similar code like this, you can submit it to your dataproc cluster this way:
gcloud dataproc jobs submit pyspark --cluser <cluster_name> --region
<region, example us-central1> gs://<bucket>/yourpyfile.py
In order to submit a new job in Dataproc you can refer to this link [1].
[1] https://cloud.google.com/dataproc/docs/guides/submit-job#submitting_a_job
Upvotes: 1
Reputation: 33
Use module os with abspath as follows:
import os.path
RDDFromPickle = open(os.path.abspath('/content/drive/My Drive/BD-CW2/pickleRdd', 'rb'))
RDDFromPickle = pickle.load(RDDFromPickle)
Upvotes: 0