Reputation: 1345
I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.
I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )
I directly read the file into spark using
import org.apache.spark.sql.SparkSession
var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")
//Convert this to indexed dataframe so you can parse multiple line create / data statements.
//This will also show you the structure of the sql dump for your usecase.
var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text")
// Identify all tables and data in the sql dump along with their indexes
var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))
println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())
var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))
The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?
Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?
UPDATE 1: I am writing the parser for this based on Input given by Ajay
UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested
Upvotes: 4
Views: 3788
Reputation: 1046
I second @Ajay Srivastava. I solved this problem by
I wrote a blog about it, complete details can be found there.
Upvotes: 0
Reputation: 74679
Is there any .sql parser written for scala spark ?
Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?
The SQL parser interface (ParserInterface
) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?
Please note that ParserInterface
deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.
The ParserInterface
is available as sqlParser
of a SessionState
.
scala> :type spark
org.apache.spark.sql.SparkSession
scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface
Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql
, Dataset.selectExpr
or simply expr
standard function. You may also use the SQL parser directly.
shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.
Upvotes: 4
Reputation: 1181
You need to parse it by yourself. It requires following steps -
Upvotes: 2