application web on spark?

Question

I've some performance issues and i ve few questions for you :) I created a scala application. This application calculate in live some statistics like the session ... from a cassandra database. I used spray as http framework to create my API . I used spark for calculating and map reducing results from cassandra. I put my application in spark with spark-submit.

Do you think this is the best way to develop an application in spark directly ? Or should i create one application (http) outside of spark , and call an other app only for computing data from cassandra with spark ?

I've 3 server ( 1 with 32G 8cores, 1 with 64G, 8 cores , and the last one with 64G 12 cores) for my tests (i know it should be better if i had sames servers in my cluster but i can t for the moment). I use the standalone mode. My configuration in spark_default.sh :

spark.deploy.defaultCores=28
spark.executor.memory=30G

And for the moment it slow, it take 9 seconds with 3 spark traitement :

a map , a sortby and 1 collect (take 4s)
sum operation (take 3s)
sum operation (take 2s)

Just for a result like this :

{"result":"success","list":[{"item":"1474236000","value":6},{"item":"1474239600","value":3},{"item":"1474243200","value":3},{"item":"1474246800","value":3},{"item":"1474250400","value":3},{"item":"1474254000","value":8},{"item":"1474257600","value":4},{"item":"1474261200","value":11},{"item":"1474264800","value":1},{"item":"1474268400","value":3},{"item":"1474272000","value":18},{"item":"1474275600","value":6},{"item":"1474279200","value":4},{"item":"1474282800","value":2},{"item":"1474286400","value":2},{"item":"1474293600","value":4},{"item":"1474297200","value":10},{"item":"1474300800","value":10},{"item":"1474304400","value":8},{"item":"1474308000","value":6},{"item":"1474311600","value":8},{"item":"1474315200","value":4},{"item":"1474318800","value":4},{"item":"1474322400","value":6}],"nb_session":137.0,"old_nb_session":161}

Do you have any suggestions for me? I don't understand why it s so slow :(

Thanks a lot

Piotr Reszke · Accepted Answer

I would advise you to work directly with Cassandra and CQL. If you cannot reflect everything in CQL you can always create a User-Defined-Function (UDF).

https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateUDF.html

By default, Cassandra 2.2 and later supports defining functions in java and javascript. Other scripting languages, such as Python, Ruby, and Scala can be added by adding a JAR to the classpath. Install the JAR file into $CASSANDRA_HOME/lib/jsr223/[language]/[jar-name].jar where language is 'jruby', 'jython', or 'scala'

One of the options to make low-latency Apache Spark solution would be to keep the data in Apache Spark (across multiple requests) and just query against cached data in each request (and skip the loading-from-cassandra part). This is non-trivial.

application web on spark?

Answers (1)

Related Questions