Michal_Szulc
Michal_Szulc

Reputation: 4177

Solution to map different excel files to db

I have to map a lot of different files with different structures to a db. There is a lot of different tables in those xlsx so I thought about schemeless noSQL approach, but I'm quite newbie in this field.

It should be a microservice with client interface for choosing tables/cells for parsing xlsx files. I do not have strict technology; it could be JAVA, GROOVY, Python or even a JavaScript engine.

Do you know any working solution for doing it?

Here is example xlsx (but I've got also other files, also in xls format): http://stat.gov.pl/download/gfx/portalinformacyjny/pl/defaultaktualnosci/5502/11/13/1/wyniki_finansowe_podmiotow_gospodarczych_1-6m_2015.xlsx

Upvotes: 2

Views: 532

Answers (4)

Mike Br
Mike Br

Reputation: 918

Maybe you should try Google Sheets to display excel and Google Apps Script (https://developers.google.com/apps-script/overview) to write custom add-on for parsing data to JSON. Spreadsheet Service (https://developers.google.com/apps-script/reference/spreadsheet/) has plenty methods to access data in sheets.

Next you can send this JSON over API (https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app) or put directly into database (https://developers.google.com/apps-script/guides/jdbc).

Maybe isn't clean, but fast solution.

Upvotes: 1

Alex Proca
Alex Proca

Reputation: 487

The work you have to do is called ETL (Extract Transform Load). You need to either find a good ETL software (here is a discussion about open source ETL) or to script your own solution in a language you are used with.

The advantage of a ready made GUI software is that you just have to drag and drop data but if you have some custom logic or semi structured data like in your xlsx example, you have limited support.

The advantage of writing your own script is you have all the freedom you need.

I have done some ETL work and I used successfully Groovy for writing my own solution with custom logic and so on, and in terms of GUI I used Altova Mapforce when I had to import some exotic file types.

If you decide to write your own solution you have to:

  1. Convert all data to an easy to load format. In your case you have to convert each xls or xlsx tab to CSV with a naming convention.
  2. Load your files in your chosen language for transforming
  3. Do your logic to put data in a desirable format
  4. Save it in a database (SQL or noSQL)

Upvotes: 3

nick_n_a
nick_n_a

Reputation: 206

Javascript solution, as xlsx2csv (you can make export anywhere):

var def =  "1.xlsx";
if (WScript.Arguments.length>0) def =   WScript.Arguments(0);

var col = [];


var objShell = new ActiveXObject( "Shell.Application" );
var fs = new ActiveXObject("Scripting.FileSystemObject");



function flush(){
  WScript.Echo(col.join(';'));
  }



function import_xlsx(file) {
var strZipFile =  file; // '"1.xlsx"     'name of zip file
var outFolder = ".";    //         'destination folder of unzipped files (must exist)
var pwd =WScript.ScriptFullName.replace( WScript.ScriptName, "");
var i,j,k;
var strXlsFile = strZipFile;
var strZipFile =   strXlsFile.replace( ".xlsx",".zip").replace( ".XLSX",".zip");
fs.CopyFile (strXlsFile,strZipFile, true);
var objSource = objShell.NameSpace(pwd+strZipFile).Items();
var objTarget = objShell.NameSpace(pwd+outFolder);
 for  (i=0;i<objSource.Count;i++)
   if (objSource.item(i).Name == "xl"){
       if  (fs.FolderExists("xl")) fs.DeleteFolder("xl");
       objTarget.CopyHere(objSource.item(i), 256);
       }

var xml =  new ActiveXObject("Msxml2.DOMDocument.6.0");
xml.load("xl\\sharedStrings.xml");
var sel = xml.selectNodes("/*/*/*") ; 
var vol = [];
for(i=0;i<sel.length;i++) vol.push(sel[i].text);
xml.load  ("xl\\worksheets\\sheet1.xml");
ret = "";
var line = xml.selectNodes("/*/*/*");
var li, line2 = 0, line3=0, row;

for (li = 0; li< line.length; li++){
  if (line[li].nodeName == "row") 
     for (row=0;row<line[li].childNodes.length;row++){
         r = line[li].childNodes[row].selectSingleNode("@r").text;
         line2 = eval(r.replace(r.substring(0,1),""));
         if (line2 != line3) {
            line3 = line2;
            if (line3 != 0) {
              //flush -------------------------- line3
              flush();
              for (i=0;i<col.length;i++) col[i]="";
            }
         }
        try{
        t = line[li].childNodes[row].selectSingleNode("@t").text;
        //i = instr("ABCDEFGHIJKLMNOPQRSTUVWXYZ", left(r,1))
        i = ("ABCDEFGHIJKLMNOPQRSTUVWXYZ").indexOf(r.charAt(0));
        while (i > col.length)  col.push("");        
        if (t == "s"){
           t = eval(line[li].childNodes[row].firstChild.text)
           col[i] = vol[t];
        } else col[i]  = line[li].childNodes[row].firstChild.text;
        } catch(e) {};


}
flush();
}

if  (fs.FolderExists("xl")) fs.DeleteFolder("xl");
if (fs.FileExists(strZipFile)) fs.DeleteFile(strZipFile);
}


import_xlsx(def);

Upvotes: 0

dangsonbk
dangsonbk

Reputation: 103

I had a project that done work almost the same as your problem but it seem easier as I had a fixed structure of xlsx files.

For xlsx parsing, I had experiment with Python and Openpyxl and had no struggle while working with them, they are simple, fast and easy to use.

For database, I recommend using MongoDB, you can deal with documents and collections in MongoDB just as simple as working with JSON objects or a set of JSON objects. PyMongo is the best and recommended way to work with MongoDB from Python I think.

The problem is you have different files with different structures. I cannot recommend anything deeper on this without viewing your data. But you should find the general structure of them or you have to figure out the way to classify them into common sets, each set will be parsed using appropriate algorithm.

Upvotes: 1

Related Questions