Reputation: 405
I have a basic question which I was hoping to better understand:
Background
Suppose I have a huge CSV file (50 GB) that I would like to make available for analysis across a data science team. Ideally, each member of the team would be able to interact with the data in the language of their choice, the data wouldn't need to move frequently (given its size) and all would have flexible access to computational resources.
Proposed Solution
Apache Spark appears to be the current front-runner of solutions which meets the above requirements. Scala, Python, SQL and R are all able to access the data where it sites, atop (if leveraging a cloud provider such as DataBricks, Azure, AWS, Cloudera) flexible computational resources.
Question
Take a specific example in the Microsoft Azure / HDInsight domain. Suppose we were to upload this large CSV to Azure Data Lake. If we then leverage Spark within HDInsight to define a schema for this data, will we need to move / import the data from where it resides?
My understanding, which may be mistaken, is that a key benefit is the data being able to reside in its native, CSV format in the Data Lake. Running computations on it does not require it to be moved. Furthermore, if we wished to frequently bring down / bring up Spark clusters on an as-needed basis, we can do so simply re-pointing them to the cheaply stored CSVs.
Conclusion
Any confirmation you are able to provide regarding the above, or clarifications regarding misunderstandings, would be much appreciated. The Hadoop / Spark ecosystem continues to evolve rapidly, and I'd like to ensure I have a correct understanding as to its current capabilities.
Upvotes: 2
Views: 464
Reputation: 261
2 Points to note:
Hope it answers your question.
Upvotes: 1
Reputation: 331
Short answer is yes, the file can remain in Azure Data Lake store. You can simply add your Data Lake Store as an additional storage account to your Spark HDInsight Cluster or if even make it your default storage account when provisioning your cluster. This will give all your Spark jobs access to your data files residing in your storage account(s).
Please see here for more information: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#using-azure-data-lake-store-with-hdinsight-clusters
Note that if you choose to tear down your HDInsight cluster and you are using Hive in conjunction with Spark for schema/table persistence, make sure you are using an external database to host your metastore.
Please see here for more information on external metastores: https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-provision-linux-clusters#a-nameuse-hiveoozie-metastoreahive-metastore
Upvotes: 1