Reputation: 23
I have a number of different CSV files that need to be processed hourly into different tables of a MySQL database. I am writing an ingestion engine in Python that takes an xml mapping file that looks like:
<table name = "example_table">
<column name = 'test'>
<source type = 'sql'>schema.table_name.column_name</source>
</column>
<column name = 'test_2'>
<source type = 'csv'>column_name</source>
</column>
</table>
And uses this to determine where to insert the data into the MySQL database. In this example:
find column "test" in table "example_table" and fill in data from another sql table "schema.table_name.column_name" (this is usually a primary key of another table).
find column 'test_2' and fill in data from a the csv file keyed by 'column_name'
Am I way off base here? Does this seem like a reasonable way to do it? The goal is to have one python engine and multiple xml mapping files so I can process each group of inserts effectively.
Is there a better method to go about it?
Upvotes: 1
Views: 1302
Reputation: 22473
There's nothing inherently wrong with this scheme. XML is a little verbose and a bit overkill for such simple mappings, but it's highly standardized, easily edited, and will work just fine. You should be able to loop through this structure easily:
from lxml import objectify
table = objectify.fromstring(xml_source)
print "table name:", table.attrib['name']
for col in table.column:
print "column name:", col.attrib['name']
print "source type:", col.source.attrib['type']
print "source col:", col.source.text
Many developers now prefer JSON or YAML configuration files to XML. For example, if you wanted similar JSON:
{
"table":"example_table",
"columns":[
{
"name":"test",
"source_type":"sql",
"source_col":"schema.table_name.column_name"
},
{
"name":"test_2",
"source_type":"csv",
"source_col":"column_name"
}
]
}
Which you could also iterate through easily:
j = json.loads(json_source)
print "table name:", j['table']
for col in j['columns']:
print "column name:", col['name']
print "source type:", col['source_type']
print "source col:", col['source_col']
Whatever the specific format chosen, using data-driven recipes is a flexible way to feed your ingestion engine.
Upvotes: 1