Reputation: 4084
I have been a sql and c# developer and just got into the world spark and hadoop. This is the scenario of my daily work:
To get the performance or statistics about a share or fund, we will have to retrieve the history data for each instrument and do the mathematics calculation.
We are doing the calculation in a multi threaded way in c#(i.e. in our c# code, we create multiple threads to load data from database and do the calculation).
With my extremely limited experience of spark and Hadoop, This is my feeling about the changes needed if we move from c# to spark:
Is my understanding right?
A couple of things I don't really know and hope someone can give me some lights on:
For example, say I have this function in c#:
CalculatePerformance(string code, DateTime start, DateTime end)
{
var historyData = LoadHistory(code, start, end);
CalculatePerformance(historyData);
}
I can easily rewrite this in Python, but how will spark works internally to make the calculation much faster, is it like spark will create lots of threads or something?
Upvotes: 0
Views: 69
Reputation: 22691
Be very careful, Spark is not magic at all, and need a lot of work to be "controlled" ;-)
You must understand RDD / partition / job / DAG concept first, and with Hadoop, you will run Spark on YARN, this is another big subject to learn.
Spark do not load data in memory, there are input "plugins" (that gives partitions number) and you decide to cache data in memory or disk or not at all.
Welcome to Java, you must deal with Garbage Collector, make sure there is enough memory to fit your data AND to run your code, with lot of data and nodes, this is very tricky to tune.
To be honest, if this can work with C#, stay with it! And never trust read word count examples that look awesome on Spark.
Upvotes: 0
Reputation: 695
I can summarize a spark job in four main steps:
1- Initializing Spark just to create your sparContext, no parallelism here
2- load data with sparContext.textfile(path), it exist many way to load your data depending in their type (jscon,csv,parquet,...): this operation will create an RDD[T] (Dataset or DataFrame in other use cases)
3- make a transformation on the RDD[T] (map, filter....) for exemple you can transform your RDD[T] in an RDD[X] with your custom functions (T => X)
4- make an actions (collect, reduce, ...)
So if you call an action at the end, each spark executor will create many threads (depends on data repartitions) to load => transform => action.
Sorry for this quick explanation, There are many other things in spark (stages, shuffle, cache, ...), I hope that allows you to understand the beginning.
Upvotes: 1