Reputation: 865
In order to load large SQL Data on to Spark for transformation & ML which of these below option is better in terms of performance.
Option 1: Use Spark SQL JDBC connector to load directly SQLData on to Spark.
Option 2: Use Sqoop to load SQLData on to HDFS in csv format and then Use Spark to read the data from HDFS.
Please suggest which of the above in a good approach to load large SQL data on to Spark.
Upvotes: 3
Views: 10958
Reputation: 21
I would suggest to use Sqoop to ingest data into HDFS and then use Spark for analysis on it, as seen from below observations which I have done to import a sample 32 GB table from Mysql to HDFS. Also as suggested by chet, you can or should use Parquet file format while importing as it considerably reduce file sizes as seen in these observations.
Mysql Database Table “EMP_TEST”, No. Of Records and Size Size is around 32.7 GB and No. of Records are around 77.5 Million.
Import into HDFS using Sqoop as seen below. When used sqoop to import into HDFS, it ran smoothly and took around 8 minutes to complete process. By default sqoop used “snappy” compression (as seen in logs) and total size of the files in HDFS is around 320 MB only.
Import into HDFS using Spark as seen below. When tried to import using Spark, it failed miserably as seen in below screenshot.
In spark, when dataframe is created using parquet files imported by sqoop, then it runs very smoothly as seen below.
DataFrame created in Spark using data imported using sqoop
Upvotes: 0
Reputation: 23
if you wants to use further Spark for transformation & ML, you can use spark sql to load data in hdfs or you can create hive table directly.It will be easy to write code in same project.Followings are my observation about performance:
1.I have used 39 GB table to migrate for comparison where as i had 300 gb memory and 50 core cluster so sqoop and spark performance were same. both jobs took 12 min to migrate data in hive table.I hope if we have big number of count of memory and core then it will make difference at least 20-30 percent in processing speed. 2. Other advantage is we can write validation code in same spark script.
Upvotes: 0
Reputation: 22105
I've never used Squoop but the answer probably depends on your use case. For just a single job where you want to query some relational SQL data from Spark you should just use the built-in JDBC connector. That's the whole point of an analytics database: it's a way to store large number of records with a uniform structure in such a way that it can be queried quickly and accurately.
If you had network limitations between your SQL database and your Spark cluster and were running a lot of jobs off the result dataset and were trying to minimize requests to your database it might make sense to transfer the data first. One practical example that might merit building a copy task (which sounds like it doesn't apply in your case) might be if your database and cluster are behind separate firewalls.
One other note. Should you decide you need to copy your data into a file first, you probably should look at alternatives to CSV. Look into some of the benefits that a format like Parquet might offer, especially if you're looking to transfer/store/query an extremely large columnar-oriented dataset.
Upvotes: 6