Leothorn
Leothorn

Reputation: 1345

How to extract tables with data from .sql dumps using Spark?

I have around four *.sql self-contained dumps ( about 20GB each) which I need to convert to datasets in Apache Spark.

I have tried installing and making a local database using InnoDB and importing the dump but that seems too slow ( spent around 10 hours with that )

I directly read the file into spark using

import org.apache.spark.sql.SparkSession

var sparkSession = SparkSession.builder().appName("sparkSession").getOrCreate()
var myQueryFile = sc.textFile("C:/Users/some_db.sql")

//Convert this to indexed dataframe so you can parse multiple line create / data statements. 
//This will also show you the structure of the sql dump for your usecase.

var myQueryFileDF = myQueryFile.toDF.withColumn("index",monotonically_increasing_id()).withColumnRenamed("value","text") 


// Identify all tables and data in the sql dump along with their indexes

var tableStructures = myQueryFileDF.filter(col("text").contains("CREATE TABLE"))
var tableStructureEnds = myQueryFileDF.filter(col("text").contains(") ENGINE"))

println(" If there is a count mismatch between these values choose different substring "+ tableStructures.count()+ " " + tableStructureEnds.count())

var tableData = myQueryFileDF.filter(col("text").contains("INSERT INTO "))

The problem is that the dump contains multiple tables as well each of which needs to become a dataset. For which I need to understand if we can do it for even one table. Is there any .sql parser written for scala spark ?

Is there a faster way of going about it? Can I read it directly into hive from .sql self-contained file?

UPDATE 1: I am writing the parser for this based on Input given by Ajay

UPDATE 2: Changing everything to dataset based code to use SQL parser as suggested

Upvotes: 4

Views: 3788

Answers (3)

Yawar
Yawar

Reputation: 1046

I second @Ajay Srivastava. I solved this problem by

  1. Used bash to distribute the dump file into multiple compressed files.
  2. Read these files through sc.binaryFiles and converted to CSV.
  3. Later convert this CSV to ORC.

I wrote a blog about it, complete details can be found there.

Upvotes: 0

Jacek Laskowski
Jacek Laskowski

Reputation: 74679

Is there any .sql parser written for scala spark ?

Yes, there is one and you seem to be using it already. That's Spark SQL itself! Surprised?

The SQL parser interface (ParserInterface) can create relational entities from the textual representation of a SQL statement. That's almost your case, isn't it?

Please note that ParserInterface deals with a single SQL statement at a time so you'd have to somehow parse the entire dumps and find the table definitions and rows.

The ParserInterface is available as sqlParser of a SessionState.

scala> :type spark
org.apache.spark.sql.SparkSession

scala> :type spark.sessionState.sqlParser
org.apache.spark.sql.catalyst.parser.ParserInterface

Spark SQL comes with several methods that offer an entry point to the interface, e.g. SparkSession.sql, Dataset.selectExpr or simply expr standard function. You may also use the SQL parser directly.


shameless plug You may want to read about ParserInterface — SQL Parser Contract in the Mastering Spark SQL book.

Upvotes: 4

Ajay Srivastava
Ajay Srivastava

Reputation: 1181

You need to parse it by yourself. It requires following steps -

  1. Create a class for each table.
  2. Load files using textFile.
  3. Filter out all the statements other than insert statements.
  4. Then split the RDD using filter into multiple RDDs based on the table name present in insert statement.
  5. For each RDD, use map to parse values present in insert statement and create object.
  6. Now convert RDDs to datasets.

Upvotes: 2

Related Questions