pltrdy
pltrdy

Reputation: 2109

Jupyter & PySpark: How to run multiple notebooks

I am using Spark 1.6.0 on three VMs, 1x Master (standalone), 2x workers w/ 8G RAM, 2CPU each.

I am using the kernel configuration below:

{
 "display_name": "PySpark ",
 "language": "python3",
 "argv": [
  "/usr/bin/python3",
  "-m", 
  "IPython.kernel", 
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "<mypath>/spark-1.6.0",
  "PYTHONSTARTUP": "<mypath>/spark-1.6.0/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "--master spark://<mymaster>:7077  --conf   spark.executor.memory=2G pyspark-shell --driver-class-path /opt/vertica/java/lib/vertica-jdbc.jar"
 }  
}  

Currently, this works. I can use spark context sc & sqlContext without import, as in pyspark shell.

Problem comes when I use multiple notebooks: On my spark master I see two 'pyspark-shell' apps, which kinda make sense, but only one can run at a time. But here, 'running' does not mean executing anything, even when I do not run anything on a notebook, this will be shown as 'running'. Given this, I can't share my resources between notebooks, which is quite sad (i currently have to kill the first shell (= notebook kernel) to run the second).

If you have any ideas about how to do it, tell me! Also, I'm not sure if the way i'm working with kernels is 'best practice', i already had trouble just setting spark & jupyter to work together.

Thx all

Upvotes: 8

Views: 2464

Answers (1)

pcc
pcc

Reputation: 61

The problem is the database used by Spark to store metastore (Derby). Derby is a light weight database system and can only run one Spark instance at a time. The solution is to setup another database system to deal with multi instances (postgres, mysql...).

For example, you can use postgres DB.

  • Add postgres jar in spark/jars
  • Add a config file (hive-site.xml) in spark conf
  • Install postgres on your machine
  • Add a user, password and db for spark/hive in postgres (depends on your values in hive-site.xml)

Example on a linux shell:

# download postgres jar
wget  https://jdbc.postgresql.org/download/postgresql-42.1.4.jar

# install postgres on your machine
pip install postgres

# add user, pass and db to postgres
psql -d postgres -c "create user hive"
psql -d postgres -c "alter user hive with password 'pass'"
psql -d postgres -c "create database hive_metastore"
psql -d postgres -c "grant all privileges on database hive_metastore to hive"

hive-site.xml:

<configuration>

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:postgresql://localhost:5432/hive_metastore</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.postgresql.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>pass</value>
</property>

</configuration>

Upvotes: 1

Related Questions