amormachine
amormachine

Reputation: 405

Data Movement Within the Hadoop / Spark Ecosystem

I have a basic question which I was hoping to better understand:

Background

Suppose I have a huge CSV file (50 GB) that I would like to make available for analysis across a data science team. Ideally, each member of the team would be able to interact with the data in the language of their choice, the data wouldn't need to move frequently (given its size) and all would have flexible access to computational resources.

Proposed Solution

Apache Spark appears to be the current front-runner of solutions which meets the above requirements. Scala, Python, SQL and R are all able to access the data where it sites, atop (if leveraging a cloud provider such as DataBricks, Azure, AWS, Cloudera) flexible computational resources.

Question

Take a specific example in the Microsoft Azure / HDInsight domain. Suppose we were to upload this large CSV to Azure Data Lake. If we then leverage Spark within HDInsight to define a schema for this data, will we need to move / import the data from where it resides?

My understanding, which may be mistaken, is that a key benefit is the data being able to reside in its native, CSV format in the Data Lake. Running computations on it does not require it to be moved. Furthermore, if we wished to frequently bring down / bring up Spark clusters on an as-needed basis, we can do so simply re-pointing them to the cheaply stored CSVs.

Conclusion

Any confirmation you are able to provide regarding the above, or clarifications regarding misunderstandings, would be much appreciated. The Hadoop / Spark ecosystem continues to evolve rapidly, and I'd like to ensure I have a correct understanding as to its current capabilities.

Upvotes: 2

Views: 464

Answers (2)

Pramod Sripada
Pramod Sripada

Reputation: 261

2 Points to note:

  1. Efficient storage using Parquet: It is better to store data in Parquet format rather than a CSV, because its saves a lot of space and with Spark with Parquet(due to its columnar format) will give you better performance for queries because of predicate pushdown. You can compress your files up to 60 % using Parquet.
  2. Data Locality data resides on executor machines: If you are creating your cluster on Azure and storing data on Azure Data Lake then there will be some data movement from data lake to the executors, unless the data is local to executors.

Hope it answers your question.

Upvotes: 1

devlace
devlace

Reputation: 331

Short answer is yes, the file can remain in Azure Data Lake store. You can simply add your Data Lake Store as an additional storage account to your Spark HDInsight Cluster or if even make it your default storage account when provisioning your cluster. This will give all your Spark jobs access to your data files residing in your storage account(s).

Please see here for more information: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#using-azure-data-lake-store-with-hdinsight-clusters

Note that if you choose to tear down your HDInsight cluster and you are using Hive in conjunction with Spark for schema/table persistence, make sure you are using an external database to host your metastore.

Please see here for more information on external metastores: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters#a-nameuse-hiveoozie-metastoreahive-metastore

Upvotes: 1

Related Questions