Reputation: 129
I am familiar with SQL (especially postgres) and VBA, but on the Apache spark side, I am a total newbie, but it seems to run and return queries results much faster than using SQL?
As of now in my daily work, I connect my Excel VBA with Postgresql through OLEDB
(others ppl use ODBC
and etc.), so whenever I need to retrieve something from DB
, I can easily do so by set a connection and write a SQL string in VBA, and then dump the output in the desired sheets and cells. But the drawback there is the speed, as my data grows larger and larger, when I need to run complicated SQL queries
for complicated calculations or relations, it took a long long wait to get the results.
So other than upgrade our server which host the DB
, I heard Spark/Hadoop
is the solution for speeding up the task?
Typically when I need to do the VBA-postgres interaction, I do something like the following:
Public Sub refresh_cf()
Dim dataConn As New ADODB.Connection
Dim strSQL As String
Dim strCON As String
Dim strCmd As New ADODB.Command
Dim loadTable As QueryTable
strCON = "Server=server IP;" & _
"DSN=PostgreSQL35W;" & _
"UID=USERNAME;" & _
"PWD=PASSWORD;" & _
"Database=DBNAME;" & _
"Port=5432;" & _
"CommandTimeout=12"
dataConn.ConnectionString = strCON
dataConn.Open
strSQL = "SELECT * FROM TABLE WHERE...."
strCmd.ActiveConnection = dataConn
strCmd.CommandType = adCmdText
strCmd.CommandText = strSQL
strCmd.CommandTimeout = 0
Set loadTable = Sheet2.QueryTables.Add(Connection:=strCmd.Execute,
Destination:=Sheet2.Range("A4"))
With loadTable
.BackgroundQuery = False
.AdjustColumnWidth = False
.refresh
End With
End Sub
I am wondering if we can achieve the same to connect with Apache Spark or Hadoop and return query results by this way??
Say each of the databases we are processing is fairly huge, billions of rows (and countless cells per row), and if I perform some complex relational calculations within this DB, currently the postgres on our server may take hours if not days to complete the task (even the returned results might not be that big, i.e. does not exceed the 1.6 million rows limit of per excel sheet), is it worthy to utilize Hadoop or Spark via VBA if possible?
I know we can do this in Python
for sure, something like:
Python
from pyspark.sql import HiveContext, Row
#or
from pyspark.sql import SQLContext, Row
sc = SparkContext(...)
hiveCtx = HiveContext(sc)
#Then we can run query
hiveCtx.sql("""SELECT * FROM TABLE WHERE....""")
Moreover, I find a link which introduced the ODBC connection to Hadoop, but can someone share your way if you are doing so, and provide some basic example code to clarify the process? Thanks.
Upvotes: 1
Views: 1967
Reputation: 20352
I don't have Hadoop installed on my personal laptop, so I can't test this process, but I think it should be essentially like this.
You can see all details here.
Upvotes: 1