daxu
daxu

Reputation: 4084

Is this a right apache spark usage scenrio?

I have been a sql and c# developer and just got into the world spark and hadoop. This is the scenario of my daily work:

  1. We have some giant tables which contain share and fund price data.
  2. To get the performance or statistics about a share or fund, we will have to retrieve the history data for each instrument and do the mathematics calculation.

  3. We are doing the calculation in a multi threaded way in c#(i.e. in our c# code, we create multiple threads to load data from database and do the calculation).

With my extremely limited experience of spark and Hadoop, This is my feeling about the changes needed if we move from c# to spark:

  1. I need to covert all calculation to python.
  2. I need load SQL data to Hadoop
  3. Spark will be in charge to run my function and I don't need to write multithread code any more.
  4. As Spark loads data in memory and do parallel computation, it will be much faster than the c# way.

Is my understanding right?

A couple of things I don't really know and hope someone can give me some lights on:

  1. How spark know which function can be split to run parallel?
  2. Do I need to write my code in a way so it can run in parallel (e.g. split data loading and calculation)?

For example, say I have this function in c#:

CalculatePerformance(string code, DateTime start, DateTime end)
{
    var historyData = LoadHistory(code, start, end);
    CalculatePerformance(historyData);
}

I can easily rewrite this in Python, but how will spark works internally to make the calculation much faster, is it like spark will create lots of threads or something?

Upvotes: 0

Views: 69

Answers (2)

Thomas Decaux
Thomas Decaux

Reputation: 22691

Be very careful, Spark is not magic at all, and need a lot of work to be "controlled" ;-)

You must understand RDD / partition / job / DAG concept first, and with Hadoop, you will run Spark on YARN, this is another big subject to learn.

Spark do not load data in memory, there are input "plugins" (that gives partitions number) and you decide to cache data in memory or disk or not at all.

Welcome to Java, you must deal with Garbage Collector, make sure there is enough memory to fit your data AND to run your code, with lot of data and nodes, this is very tricky to tune.

To be honest, if this can work with C#, stay with it! And never trust read word count examples that look awesome on Spark.

Upvotes: 0

Mehrez
Mehrez

Reputation: 695

I can summarize a spark job in four main steps:

1- Initializing Spark just to create your sparContext, no parallelism here

2- load data with sparContext.textfile(path), it exist many way to load your data depending in their type (jscon,csv,parquet,...): this operation will create an RDD[T] (Dataset or DataFrame in other use cases)

3- make a transformation on the RDD[T] (map, filter....) for exemple you can transform your RDD[T] in an RDD[X] with your custom functions (T => X)

4- make an actions (collect, reduce, ...)

So if you call an action at the end, each spark executor will create many threads (depends on data repartitions) to load => transform => action.

Sorry for this quick explanation, There are many other things in spark (stages, shuffle, cache, ...), I hope that allows you to understand the beginning.

Upvotes: 1

Related Questions