Pasha
Pasha

Reputation: 179

Extract and sort data from .mdb file using mdbtools in Python

I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.

What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.

Cycle#    Time        Data
  1      100.59        34
  1      101.34        54
  1      98.78         45  
  2      
  2
  2   ........... 

Here is the script I use. I am using the command python extract.py table_files.mdb. But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.

import sys, subprocess, os

DATABASE = sys.argv[1]

subprocess.call(["mdb-schema", DATABASE, "mysql"])

# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
                               stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()

print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()

# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
    if table != '':
        filename = table.replace(" ","_") + ".csv"
        file = open(filename, 'w')
        print("Dumping " + table)
        contents = subprocess.Popen(["mdb-export", DATABASE, table],
                                    stdout=subprocess.PIPE).communicate()[0]
        file.write(contents)
        file.close()

Upvotes: 4

Views: 16255

Answers (3)

Rein
Rein

Reputation: 115

There's an alternative to mdbtools for Python: JayDeBeApi with the UcanAccess driver. It uses a Python -> Java bridge which slows things down, but I've been using it with considerable success and comes with decent error handling.

It takes some practice setting it up, but if you have a lot of databases to wrangle, it's well worth it.

Upvotes: 0

kmader
kmader

Reputation: 1369

To bring it directly to Pandas in python3 I wrote this little snippet

import sys, subprocess, os
from io import StringIO
import pandas as pd
VERBOSE = True
def mdb_to_pandas(database_path):
    subprocess.call(["mdb-schema", database_path, "mysql"])
    # Get the list of table names with "mdb-tables"
    table_names = subprocess.Popen(["mdb-tables", "-1", database_path],
                                   stdout=subprocess.PIPE).communicate()[0]
    tables = table_names.splitlines()
    sys.stdout.flush()
    # Dump each table as a stringio using "mdb-export",
    out_tables = {}
    for rtable in tables:
        table = rtable.decode()
        if VERBOSE: print('running table:',table)
        if table != '':
            if VERBOSE: print("Dumping " + table)
            contents = subprocess.Popen(["mdb-export", database_path, table],
                                        stdout=subprocess.PIPE).communicate()[0]
            temp_io = StringIO(contents.decode())
            print(table, temp_io)
            out_tables[table] = pd.read_csv(temp_io)
    return out_tables

Upvotes: 2

Gord Thompson
Gord Thompson

Reputation: 123839

Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools, unixODBC and pyodbc to work together. As Pedro suggested in his comment, if you can get mdb-export to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools on the Linux platform.

A few suggestions:

  1. Given the sheer number of .mdb files (and hence .csv files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.

  2. When creating your target table in the new database you'll probably want to use a decimal (as opposed to float) data type for the [Time] column.

  3. At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.

Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:

SELECT
    v1.SourceFile, 
    v1.Cycle,
    v1.Time, 
    v1.Data 
FROM 
    Voltage v1 
    INNER JOIN 
    (
        SELECT
            SourceFile, 
            Cycle, 
            MAX([Time]) AS MaxTime 
        FROM Voltage 
        GROUP BY SourceFile, Cycle
    ) v2 
        ON v1.SourceFile=v2.SourceFile 
           AND v1.Cycle=v2.Cycle 
           AND v1.Time=v2.MaxTime

Upvotes: 5

Related Questions