How to connect to Apache spark/hadoop from VBA

Question

I am familiar with SQL (especially postgres) and VBA, but on the Apache spark side, I am a total newbie, but it seems to run and return queries results much faster than using SQL?

As of now in my daily work, I connect my Excel VBA with Postgresql through OLEDB (others ppl use ODBC and etc.), so whenever I need to retrieve something from DB, I can easily do so by set a connection and write a SQL string in VBA, and then dump the output in the desired sheets and cells. But the drawback there is the speed, as my data grows larger and larger, when I need to run complicated SQL queries for complicated calculations or relations, it took a long long wait to get the results.

So other than upgrade our server which host the DB, I heard Spark/Hadoop is the solution for speeding up the task?

Typically when I need to do the VBA-postgres interaction, I do something like the following:

Public Sub refresh_cf()
    Dim dataConn As New ADODB.Connection
    Dim strSQL As String
    Dim strCON As String
    Dim strCmd As New ADODB.Command
    Dim loadTable As QueryTable

    strCON = "Server=server IP;" & _
             "DSN=PostgreSQL35W;" & _
             "UID=USERNAME;" & _
             "PWD=PASSWORD;" & _
             "Database=DBNAME;" & _
             "Port=5432;" & _
             "CommandTimeout=12"
    dataConn.ConnectionString = strCON
    dataConn.Open

    strSQL = "SELECT * FROM TABLE WHERE...."
    strCmd.ActiveConnection = dataConn
    strCmd.CommandType = adCmdText
    strCmd.CommandText = strSQL
    strCmd.CommandTimeout = 0

    Set loadTable = Sheet2.QueryTables.Add(Connection:=strCmd.Execute, 
    Destination:=Sheet2.Range("A4"))
    With loadTable
        .BackgroundQuery = False
        .AdjustColumnWidth = False
        .refresh
    End With       
End Sub

I am wondering if we can achieve the same to connect with Apache Spark or Hadoop and return query results by this way??

Say each of the databases we are processing is fairly huge, billions of rows (and countless cells per row), and if I perform some complex relational calculations within this DB, currently the postgres on our server may take hours if not days to complete the task (even the returned results might not be that big, i.e. does not exceed the 1.6 million rows limit of per excel sheet), is it worthy to utilize Hadoop or Spark via VBA if possible?

I know we can do this in Python for sure, something like:

Python

from pyspark.sql import HiveContext, Row
#or
from pyspark.sql import SQLContext, Row
sc = SparkContext(...)
hiveCtx = HiveContext(sc)
#Then we can run query
hiveCtx.sql("""SELECT * FROM TABLE WHERE....""")

Moreover, I find a link which introduced the ODBC connection to Hadoop, but can someone share your way if you are doing so, and provide some basic example code to clarify the process? Thanks.

How to connect to Apache spark/hadoop from VBA

Answers (1)

Related Questions