RobbieTheK
RobbieTheK

Reputation: 198

How to set up a shared Spark installation for multiple users (as by default db.lck prevents other users from opening)?

We'd like students to be able to start spark-shell or pyspark as their own user. However the Derby database locks the process from starting as another user:

-rw-r--r-- 1 myuser staff   38 Jun 28 10:40 db.lck

And these errors appear:

ERROR PoolWatchThread: Error in trying to obtain a connection. Retrying in 7000ms
java.sql.SQLException: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
    at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
    at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
    at org.apache.derby.impl.jdbc.EmbedConnection.setReadOnly(Unknown Source)
    at com.jolbox.bonecp.ConnectionHandle.setReadOnly(ConnectionHandle.java:1324)
    at com.jolbox.bonecp.ConnectionHandle.<init>(ConnectionHandle.java:262)
    at com.jolbox.bonecp.PoolWatchThread.fillConnections(PoolWatchThread.java:115)
    at com.jolbox.bonecp.PoolWatchThread.run(PoolWatchThread.java:82)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: ERROR 25505: A read-only user or a user in a read-only database is not permitted to disable read-only mode on a connection.
    at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
    at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
    at org.apache.derby.impl.sql.conn.GenericAuthorizer.setReadOnlyConnection(Unknown Source)
    at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.setReadOnly(Unknown Source)

Is there a work around or best practice for this scenario?

I then tried to configure MySQL using these instructions, but this happens:

[Fatal Error] hive-site.xml:7:2: The markup in the document following the root element must be well-formed.
17/06/28 12:14:13 ERROR Configuration: error parsing conf file:/usr/local/bin/spark-2.1.1-bin-hadoop2.7/conf/hive-site.xml
org.xml.sax.SAXParseException; systemId: file:/usr/local/bin/spark-2.1.1-bin-hadoop2.7/conf/hive-site.xml; lineNumber: 7; columnNumber: 2; The markup in the document following the root element must be well-formed. 74 more
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql
              ^

And here are the contents of the XML file:

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost/metastore</value>
  <description>the URL of the MySQL database</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>ourpassword</value>
</property>

<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>

<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://ourip:9083</value>
  <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>

Edit, after adding the opening and closing <configuration> tags I get this:

17/06/28 12:28:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/28 12:28:52 WARN metastore: Failed to connect to the MetaStore Server...
17/06/28 12:28:53 WARN metastore: Failed to connect to the MetaStore Server...
17/06/28 12:28:54 WARN metastore: Failed to connect to the MetaStore Server...
17/06/28 12:28:55 WARN Hive: Failed to access metastore. This class should not accessed in runtime.
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
    at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1236)
  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:466)
  at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:236)
  at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
  ... 96 more
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql
              ^

Upvotes: 0

Views: 2000

Answers (2)

RobbieTheK
RobbieTheK

Reputation: 198

Dziekuje, Jaciek, for the suggestions. I was able to configure Derby to use MySQL. I have to start it using the --jars /usr/share/java/mysql-connector-java.jar option though. Is there a way to add the option into the spark-shell script?

I tested it on another workstation and PostgreSQL following this tip seems to work nicely as well. Just a little tricky on Fedora but once I ran the correct init command and configured the pg_hba.conf it doesn't seem to need the --jars option.

Upvotes: 0

Jacek Laskowski
Jacek Laskowski

Reputation: 74729

Is there a work around or best practice for this scenario?

Yes. Let the students work with their own Spark installation (don't use a shared installation as it buys you nothing).

After all Spark is just a library to develop application with for distributed data processing and what you face is an issue with spark-shell that helps getting people started with Spark on command-line.

The reason for the issue is that spark-shell (and Spark by default) uses Derby database for catalog and Hive metastore which is available for a single user. Setting it up differently would take much more effort that just using separate Spark installations per user.

A side note: Have you considered using Databricks Cloud so the students don't even care about command line?

Upvotes: 1

Related Questions