Reputation: 972
The problem is quite simple: You have a local spark instance (either cluster or just running it in local mode) and you want to read from gs://
Upvotes: 5
Views: 14439
Reputation: 746
Try following configuration using PySpark. JARS_PATH is a string variable, containing absolute path to jar files. Do set up required environment variables.
from pyspark.sql import SparkSession
JARS_PATH = '/LOCATION-TO-JARS/gcs-connector-hadoop3-latest.jar,/LOCATION-TO-JARS/spark-bigquery-latest_2.12.jar'
spark = sparkSession.builder.appName(SPARK_APP_NAME).config('spark.jars’,JARS_PATH).getOrCreate()
spark._jsc.hadoopConfiguration().set('', '')
spark._jsc.hadoopConfiguration().set('', 'ture')
spark._jsc.hadoopConfiguration().set('', ‘MY-GCP-PROJECT-ID’)
spark._jsc.hadoopConfiguration().set("", "")
Upvotes: 2
Reputation: 41
Considering that it has been awhile since the last answer, I though I would share my recent solution. Note, the following instruction is for Spark 2.4.4.
Make sure that all the environment variables are properly set up for you Spark application to run. This is:
a. SPARK_HOME pointing to the location where you have saved Spark installations.
b. GOOGLE_APPLICATION_CREDENTIALS pointing to the location where json key is. If you have just downloaded it, it will be in your ~/Downloads
c. JAVA_HOME pointing to the location where you have your Java 8* "Home" folder.
If you are on Linux/Mac OS you can use export VAR=DIR
, where VAR is variable and DIR the location, or if you want to set them up permanently, you can add them to ~/.bash_profile or ~/.zshrc files.
For Windows OS users, in cmd write set VAR=DIR
for shell related operations, or setx VAR DIR
to store the variables permanently.
That has worked for me and I hope it help others too.
* Spark works on Java 8, therefore some of its features might not be compatible with the latest Java Development Kit.
Upvotes: 4
Reputation: 181
In my case on Spark 2.4.3 I needed to do the following to enable GCS access from Spark local. I used a JSON keyfile vs. the
proposed above.
In $SPARK_HOME/jars/
, use the shaded gcs-connector
jar from here: or else I had various failures with transitive dependencies.
(Optional) To my build.sbt
"" % "gcs-connector" % "hadoop2-1.9.17"
exclude("javax.jms", "jms")
exclude("com.sun.jdmk", "jmxtools")
exclude("com.sun.jmx", "jmxri")
In $SPARK_HOME/conf/spark-defaults.conf
, add: true /path/to/my/keyfile
And everything is working.
Upvotes: 8
Reputation: 972
I am submitting here the solution I have come up with by combining different resources:
Download the google cloud storage connector : gs-connector and store it in $SPARK/jars/
folder (Check Alternative 1 at the bottom)
Download the core-site.xml
file from here, or copy it from below. This is a configuration file used by hadoop, (which is used by spark).
Store the core-site.xml
file in a folder. Personally I create the $SPARK/conf/hadoop/conf/
folder and store it there.
In the file indicate the hadoop conf fodler by adding the following line: export HADOOP_CONF_DIR=
Create an OAUTH2 key from the respective page of Google (Google Console-> API-Manager-> Credentials
Copy the credentials to the core-site.xml
Alternative 1: Instead of copying the file to the $SPARK/jars
folder, you can store the jar in any folder and add the folder in the spark classpath. One way is to edit SPARK_CLASSPATH
in the``folder but
SPARK_CLASSPATH` is now deprecated. Therefore one can look here on how to add a jar in the spark classpath
<description>Register GCS Hadoop filesystem</description>
<description>Force OAuth2 flow</description>
<description>Client id of Google-managed project associated with the Cloud SDK</description>
<description>Client secret of Google-managed project associated with the Cloud SDK</description>
<description>This value is required by GCS connector, but not used in the tools provided here.
The value provided is actually an invalid project id (starts with `_`).
Upvotes: 5