Spark scala coding standards

Question

I am reaching out to the community to understand the impact of coding in certain ways in scala for Spark. I received some review comments that I feel need discussion. Coming from a traditional Java and OOP background, I am writing my opinion and questions here. I would appreciate if you could chime in with your wisdom. I am in a Spark 1.3.0 environment.

1. Use of for loops: Is it against the rules to use for loops?

There are distributed data structures like RDDs and DataFrames in Spark. We should not be collecting and using for loops on them, as the computation will end up happening on the driver node alone. This will have adverse affects especially if the data is large.

But if I have a utility map that stores parameters for the job, it is fine to use a for loop on it if desired. Using a for loop or a map on the iteratable is a coding choice. It is important to understand that this map here is different from map on a distributed data structure. This map will still happen on the driver node alone.

2. Use of var vs val

val is an immutable reference to an object and var is a mutable reference. In the example below

val driverDf =       {
    var df = dataLoader.loadDriverInput()
    df = df.sqlContext.createDataFrame(df.rdd, df.schema)
    df.persist(StorageLevel.MEMORY_AND_DISK_SER)
  }

Even though we have used var for the df, driverDf is an immutable reference to the originally created data frame. This kind of use for var is perfectly fine.

Similarly the following is also fine.

var driverDf =   dataLoader.loadDriverInput();
driverDf = applyTransformations (driverDf)

def applyTransformations (driverDf:DataFrame)={...}

Are there any generic rules that say vars cannot be used in Spark environment?

3. Use of if-else vs case, not throw exceptions

Is it against standard practices to not throw exceptions or not to use if-else?

4. Use of hive context vs sql context

Are there any performance implications of using SQLContext vs HiveContext (I know HiveContext extends SQLContext) for underneath Hive tables?

Is it against standards to create multiple HiveContexts in the program. My job is iterates through a part of a whole data frame of values every time. Whole data frame is cached in a one hive context. Each iteration data frame is created from the whole data using a new hive context and cached. This cache is purged at the end of iteration. This approach gave me performance improvements in Spark 1.3.0. Is this approach breaking any standards?

I appreciate the responses.

Ramzy · Accepted Answer

Regarding loops, as you mentioned correctly, you should prefer RDD map to perform operations in parallel and on multiple nodes. For smaller iterables, you can go with for loop. Again it comes down to the driver memory and time it takes to iterate.

For smaller sets of around 100, the distributed way of handling will incur unnecessary network usage rather than giving performance boost

val or var is a choice at scala level rather than spark. I never heard of it. Its dependent on your requirement.
No sure what you you asked. The only major negative for using if-else is making them cumbersome and while handling inner-if-else. Apart from that, all should be fine. An exception can be thrown based on a condition. I see that's the one of many ways to handle issues in an otherwise Happy path flow.

As mentioned here, the compiler generates more byte code for match..case rather than simple if. So its simple condition check vs code readability - complex condition check

HiveContext gives the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. Please not in spark 2.0, both HIveContext and SQLContext are replaced with SparkSession.

Spark scala coding standards

Answers (1)

Related Questions