user6440081
user6440081

Reputation:

Import huge dataset into Access from Excel via VBA

I have a huge dataset which I need to import from Excel into Access (~800k lines). However, I can ignore lines with a particular column value, which make up like 90% of the actual dataset. So in fact, I only really need like 10% of the lines imported.

In the past I've been importing Excel Files line-by-line in the following manner (pseudo code):

For i = 1 To EOF
    sql = "Insert Into [Table] (Column1, Column2) VALUES ('" & _
    xlSheet.Cells(i, 1).Value & " ', '" & _
    xlSheet.Cells(i, 2).Value & "');"       
Next i
DoCmd.RunSQL sql

With ~800k lines this takes waaay to long as for every single line a query would be created and run.

Considering the fact that I can also ignore 90% of the lines, what is the fastest approach to import the dataset from Excel to Access?

I was thinking of creating a temporary excel file with a filter activated. And then I just import the filtered excel.

But is there a better/faster approach than this? Also, what is the fastest way to import an excel via vba access?

Thanks in advance.

Upvotes: 0

Views: 4375

Answers (3)

Dan
Dan

Reputation: 773

I'm a little late to the party but I stumbled on this looking for information on a similar problem. I thought I might share my solution in case it could help others or maybe OP, if he/she is still working on it. Here's my problem and what I did:

I have an established Access database of approximately the same number of rows as OPs (6 columns, approx 850k rows). We receive a .xlsx file with one sheet and the data in the same structure as the db about once a week from a partner company.

This file contains the entire db, plus updates (new records and changes to old records, no deletions). The first column contains a unique identifier for each row. The Access db is updated when we receive the file through similar queries as suggested by Parfait, but since it's the entire 850k+ records, this takes 10-15 minutes or longer to compare and update, depending on what else we have going on.

Since it would be faster to load just the changes into the current Access db, I needed to produce a delta file (preferably .txt that can be opened with excel and saved as .xlsx if needed). I assume this is something similar to what OP was looking for. To do this I wrote a small application in c++ to compare the file from the previous week, to the one from the current week. The data itself is an amalgam of character and numerical data that I will just call string1 through string6 here for simplicity. It looks like this:

Col1       Col2       Col3       Col4       Col5       Col6
string1    string2    string3    string4    string5    string6
.......
''''Through 850k rows''''

After saving both .xlsx files as .txt tab delimited files, they look like this:

Col1\tCol2\tCol3\tCol4\tCol5\tCol6\n
string1\tstring2\tstring3\tstring4\tstring5\tstring6\n
....
//Through 850k rows//

The fun part! I took the old .txt file and stored it as a hash table (using the c++ unordered_map from the standard library). Then with an input filestream from the new .txt file I used Col1 in the new file as a key to the hash table and output any differences to two different files. One you could use a query to append the db with new data and the other you could use to update data that has changed.

I've heard it's possible to create a more efficient hash table than the unordered_map but at the moment, this works well so I'll stick with it. Here's my code.

#include <iostream>     
#include <fstream>      
#include <string>       
#include <iterator>
#include <unordered_map>


int main()
{
    using namespace std;

    //variables
    const string myInFile1{"OldFile.txt"};
    const string myInFile2{"NewFile.txt"};
    string mappedData;
    string key;

    //hash table objects
    unordered_map<string, string> hashMap;
    unordered_map<string, string>::iterator cursor;

    //input files
    ifstream fin1;
    ifstream fin2;
    fin1.open(myInFile1);
    fin2.open(myInFile2);

    //output files
    ofstream fout1;
    ofstream fout2;
    fout1.open("For Updated.txt");  //updating old records 
    fout2.open("For Upload.txt");   //uploading new records

    //This loop takes the original input file (i.e.; what is in the database already)
    //and hashes the entire file using the Col1 data as a key. On my system this takes
    //approximately 2 seconds for 850k+ rows with 6 columns
    while(fin1)
    {
            getline(fin1, key, '\t');          //get the first column
            getline(fin1, mappedData, '\n');   //get the other 5 columns
            hashMap[key] = mappedData;         //store the data in the hash table
    }
    fin1.close();

    //output file headings
    fout1 << "COl1\t" << "COl2\t" << "COl3\t" << "COl4\t" << "COl5\t" << "COl6\n";
    fout2 << "COl1\t" << "COl2\t" << "COl3\t" << "COl4\t" << "COl5\t" << "COl6\n";

    //This loop takes the second input file and reads each line, first up to the
    //first tab delimiter and stores it as "key", then up to the new line character
    //storing it as "mappedData" and then uses the value of key to search the hash table
    //If the key is not found in the hash table, a new record is created in the upload
    //output file. If it is found, the mappedData from the file is compared to that of
    //the hash table and if different, the updated record is sent to the update output
    //file. I realize that while(fin2) is not the optimal syntax for this loop but I
    //have included a check to see if the key is empty (eof) after retrieving
    //the current line from the input file. YMMV on the time here depending on how many
    //records are added or updated (1000 records takes about another 5 seconds on my system)    
    while(fin2)
    {
        getline(fin2, key, '\t');           //get key from Col1 in the input file
        getline(fin2, mappedData, '\n');    //get the mappeData (Col2-Col6)
        if(key.empty())                     //exit the file read if key is empty
            break;
        cursor = hashMap.find(key);         //assign the iterator to the hash table at key

        if(cursor != hashMap.end())         //check to see if key in hash table
        {
            if(cursor->second != mappedData) //compare mappedData
            {          
                fout2 << key << "\t" << mappedData<< "\n";
            }
        }
        else                                //for updating old records
        {
            fout1 << key << "\t" << mappedData<< "\n";
        }
    }


    fin2.close();
    fout1.close();
    fout2.close();
    return 0;
}

There are a few things I am working on to make this an easy to use executable file (for example reading the xml structure of the excel.zip file for direct reading or maybe using an ODBC connection) but for now, I'm just testing it to make sure the outputs are correct. Of course the output files would then have to be loaded into the access database using queries similar to what Parfait suggested. Also, I'm not sure if Excel or Access VBA have a library to build hash tables but it might be worth exploring further if it saves time in accessing the excel data. Any criticisms or suggestions are welcomed.

Upvotes: 0

SunKnight0
SunKnight0

Reputation: 3351

A slight change in your code will do the filtering for you:

Dim strTest As String
For i = 1 To EOF
    strTest=xlSheet.Cells(i, 1).Value
    if Nz(strTest)<>"" Then
        sql = "Insert Into [Table] (Column1, Column2) VALUES ('" & _
        strTest & " ', '" & _
        xlSheet.Cells(i, 2).Value & "');" 
            DoCmd.RunSQL sql
      End If
Next i

I assume having the RunSQL outside the loop was just a mistake in your pseudocode. This tests for the Cell in the first column to be empty but you can substitute with any condition is appropriate for your situation.

Upvotes: 0

Parfait
Parfait

Reputation: 107567

Consider running a special Access query for the import. Add the below SQL into an Access query window or as SQL query in a DAO/ADO connection. Include any WHERE clauses which requires named column headers, right now set to HDR:No:

INSERT INTO [Table] (Column1, Column2)
SELECT *
FROM [Excel 12.0 Xml;HDR=No;Database=C:\Path\To\Workbook.xlsx].[SHEET1$];

Alternatively, run a Make-Table query in case you need a staging temp table (to remove 90% of lines) prior to final table but do note this query replaces table if exists:

SELECT * INTO [NewTable]
FROM [Excel 12.0 Xml;HDR=No;Database=C:\Path\To\Workbook.xlsx].[SHEET1$];

Upvotes: 2

Related Questions