I am building some Postgres tables from Python dictionaries where the {'key': 'value'} pairs correspond to column 'key' and field 'value'. These are generated from .dbf files -- I now pipe the contents of the .dbf files into a script that returns a list of dicts like: {'Warngentyp': '', 'Lon': '-81.67170', 'Zwatch_war': '0', 'State':... Currently I am putting these into a sqlite database with no type declarations, then dumping it to a .sql file, manually editing the schema, and importing to Postgres. I would love to be able to infer the correct type declarations, basically iterate over a list of strings like ['0', '3', '5'] or ['ga', 'ca', 'tn'] or ['-81.009', '135.444', '-80.000'] and generate something like 'int', 'varchar(2)', 'float'. (I would be equally happy with a Python, Postgres, or SQLite tool.) Is there a package that does this, or a straightforward way to implement it?

Inferring appropriate database type declarations from strings in Python

Answers (4)

Reputation: 82992

YOU DON'T NEED TO INFER THE TYPE DECLARATIONS!!!

You can derive what you want directly from the .dbf files. Each column has a name, a type code (C=Character, N=Number, D=Date (yyyymmdd), L=Logical (T/F), plus more types if the files are from Foxpro), a length (where relevant), and a number of decimal places (for type N).

Whatever software that you used to dig the data out of the .dbf files needed to use that information to convert each piece of data to the appropriate Python data type.

Dictionaries? Why? With a minor amount of work, that software could be modified to produce a CREATE TABLE statement based on those column definitions, plus an INSERT statement for each row of data.

I presume you are using one of the several published Python DBF-reading modules. Any one of them should have the facilities that you need: open a .dbf file, get the column names, get the column type etc info, get each row of data. If you are unhappy with the module that you are using, talk to me; I have an unpublished one that as far as reading DBFs goes, combines the better features of the others, avoids the worst features, is as fast as you'll get with a pure Python implementation, handles all the Visual Foxpro datatypes and the _NullFlags pseudo-column, handles memoes, etc etc.

HTH

========= Addendum: When I said you didn't need to infer types, you hadn't made it plain that you had a bunch of fields of type C which contained numbers.

FIPS fields: some are with and some without leading zeroes. If you are going to use them, you face the '012' != '12' != 12 problem. I'd suggest stripping off the leading zeroes and keeping them in integer columns, restoring leading zeroes in reports or whatever if you really need to. Why are there 2 each of state fips and county fips?

Population: in the sample file, almost all are integer. Four are like 40552.0000, and a reasonable number are blank/empty. You seem to regard population as important, and asked "Is it possible that some small percentage of population fields contain .... ?" Anything is possible in data. Don't wonder and speculate, investigate! I'd strongly advise you to sort your data in population order and eyeball it; you'll find that multiple places in the same state share the same population count. E.g. There are 35 places in New York state whose pop'n is stated as 8,008,278; they are spread over 6 counties. 29 of them have a PL_FIPS value of 51000; 5 have 5100 -- looks like a trailing zero problem :-(

Tip for deciding between float and int: try anum = float(chars) first; if that succeeds, check if int(anum) == anum.

ID: wonderful "unique ID"; 59 cases where it's not an int -- several in Canada (the website said "US cities"; is this an artifact of some unresolved border dispute?), some containing the word 'Number', and some empty.

Low-hanging fruit: I would have thought that deducing that population was in fact integer was 0.1 inches above the ground :-)

There's a serious flaw in that if all([int(value) ... logic:

>>> all([int(value) for value in "0 1 2 3 4 5 6 7 8 9".split()])
False
>>> all([int(value) for value in "1 2 3 4 5 6 7 8 9".split()])
True
>>>

You evidently think that you are testing that all the strings can be converted to int, but you're adding the rider "and are all non-zero". Ditto float a few lines later.

IOW if there's just one zero value, you declare that the column is not integer. Even after fixing that, if there's just one empty value, you call it varchar. What I suggest is: count how many are empty (after normalising whitespace (which should include NBSP)), how many qualify as integer, how many non-integer non-empty ones qualify as float, and how many "other". Check the "other" ones; decide whether to reject or fix; repeat until happy :-)

I hope some of this helps.

Upvotes: 2

unmounted

Reputation: 34408

Thanks for the help, this is a little long for an update, here is how I combined the answers. I am starting with a list of dicts like this, generated from a dbf file:

dbf_list = [{'Warngentyp': '', 'Lon': '-81.67170', 'Zwatch_war': '0', 'State':...

Then a function that returns 1000 values per column to test for the best db type declaration: {'column_name':['list', 'of', 'sample', 'values'], 'col2':['1','2','3','4'... like this:

def sample_fields(dicts_, number=1000): #dicts_ would be dbf_list from above
    sample = dict([[item, []] for item in dicts_[1]])
    for dict_ in dicts_[:number]:
        for col_ in dict_:
            sample[col_].append(dict_[col_])
    return sample

Then you combine the Unknown and jacob approach: varchar is a good default and floats and ints are basically enough for everything else, all is clear and fast:

def find_typedefs(sample_dict): #arg is output of previous function
    defs_ = {}
    for key in sample_dict:
        defs_[key] = 'varchar(255)'
        try:
            if all([int(value) for value in sample_dict[key]]):
                defs_[key] = 'int'
        except:
            try:
                if all([float(value) for value in sample_dict[key]]):
                    defs_[key] = 'float'
            except:
                continue
    return defs_

Then format the returned dict into a create table statement, iterate over the values in the original big list and feed them into the database. It works great, I am now skipping the intermediate sqlite step, thanks again.

Update for John Machin: I am using the shp2pgsql library distributed with PostGIS. It creates schema like the below with a source like this one:

   Column   |         Type          | 
------------+-----------------------+-
 gid        | integer               |
 st_fips    | character varying(7)  | 
 sfips      | character varying(5)  | 
 county_fip | character varying(12) | 
 cfips      | character varying(6)  | 
 pl_fips    | character varying(7)  | 
 id         | character varying(7)  | 
 elevation  | character varying(11) | 
 pop_1990   | integer               | 
 population | character varying(12) | 
 name       | character varying(32) | 
 st         | character varying(12) | 
 state      | character varying(16) | 
 warngenlev | character varying(13) | 
 warngentyp | character varying(13) | 
 watch_warn | character varying(14) | 
 zwatch_war | bigint                | 
 prog_disc  | bigint                | 
 zprog_disc | bigint                | 
 comboflag  | bigint                | 
 land_water | character varying(13) | 
 recnum     | integer               | 
 lon        | numeric               | 
 lat        | numeric               | 
 the_geom   | geometry              |

There is stuff there that has to be wrong -- Fips is the federal information processing standard, and it should be an integer between 0 and something like 100,000. Population, elevation, etc. Maybe I have more of a postgres specific question, I wouldn't mind loosing a small amount of data, or pushing it into a table for errors or something, while trying to change the type on say the population field. How strict is the dbf type checking? For example I see that population per shp2pgsql is varchar(12). Is it possible that some small percentage of population fields contain something like '2,445 Est.'? If I take the approach I set out in this question, with the first thousand records, I get a schema like this:

   Column   |          Type          |
------------+------------------------+-
 warngentyp | character varying(255) | 
 lon        | double precision       | 
 zwatch_war | character varying(255) | 
 state      | character varying(255) | 
 recnum     | character varying(255) | 
 pop_1990   | integer                | 
 land_water | character varying(255) | 
 elevation  | integer                | 
 prog_disc  | integer                | 
 comboflag  | character varying(255) | 
 sfips      | integer                | 
 zprog_disc | integer                | 
 pl_fips    | integer                | 
 county_fip | integer                | 
 population | integer                | 
 watch_warn | integer                | 
 name       | character varying(255) | 
 st         | character varying(255) | 
 lat        | double precision       | 
 st_fips    | integer                | 
 cfips      | integer                | 
 id         | integer                | 
 warngenlev | integer                |

On the other hand if I check every value in the all(['list', 'of', 'everything'...]), I get a schema more like the first one. I can tolerate a bit of data loss here -- if the entry for some town is wrong and it doesn't significantly affect the population figures, etc.

I am only using an old package called dbview to pipe the dbf files into these scripts -- I am not trying to map any of the format's native capability. I assumed that shp2pgsql would have picked the low-hanging fruit in that regard. Any suggestions for either dbview or another package is welcome -- although there are other cases where I may not be working with dbf files and would need to find the best types anyway. I am also going to ask a question about postgresql to see if I can find a solution at that level.

Upvotes: 1

jacob

Reputation: 1224

You can determine integers and floats unsafely by type(eval(elem)), where elem is an element of the list. (But then you need to check elem for possible bad code)

A safer way could be to do the following

a = ['24.2', '.2', '2']
try:
    if all(elem.isdigit() for elem in a):
        print("int")
    elif all(float(elem) for elem in a):
        print("float")
except:
    i = len(a[0])
    if all(len(elem)==i for elem in a):
        print("varchar(%s)"%i)
    else:
        print "n/a"

Upvotes: 1

Unknown

Reputation: 46811

Don't use eval. If someone inserts bad code, it can hose your database or server.

Instead use these

def isFloat(s):
try:
    float(s)
    return True
except (ValueError, TypeError), e:
    return False


str.isdigit()

And everything else can be a varchar

Upvotes: 5

Inferring appropriate database type declarations from strings in Python

Answers (4)

Related Questions