Reputation: 2046
I've been assigned a task for our web app where I need to take genres of differing names (Horror, Comedy, and so on) coming in from various sources (Facebook Opengraph, XML Feeds, etc.) and reduce the data down to a "master" table.
An example of this is we have a genre called "Action", and one of our feeds has a genre dubbed "Action/Adventure". Instead, I'd rather the "Action/Adventure" films be assigned the "Action" master genre.
I was thinking of writing a hard-coded hash map. We use something like this with languages:
languages = { "en_US" => "English", "en_GB" => "English" }
Does anyone know of a better way? Perhaps I should rely on a look-up table in the database? Cheers!
Upvotes: 2
Views: 114
Reputation: 47392
I think that it depends on what you're using to pull in all of this data. Randy is correct in that you should probably preserve the original data in some way, although that doesn't have to be in your production database - it could be in some interim form, which might be text files or another staging database.
At the risk of going slightly off-topic...
Usually when I'm performing complex ETL from multiple sources I use a two-step process. The first step is to consolidate all of the inputs into one format. This might be in CSV or XML files or it might be in a staging database.
From there I have a second process that loads the data into the production or master database. The advantage here is that once you have your import (or "load") code complete, with all of the potentially complicated business logic, etc. you very rarely have to touch it (and potentially break it) again. If a new data source comes online you just have to write a new bit of code to get it into your universal format. Once it's in that format, you know that your import process will handle it correctly.
Again though, how you translate (or "transform") data will depend on what you're using for the heavy lifting in your ETL system. If you're using a staging database then it makes sense to put it into a table. You usually want to do this transformation when going from the raw data to the universal format.
Upvotes: 2
Reputation: 16673
i think the trick here is to make sure you preserve the original designation, along with the source identifier. You may then create a map between the source designation and the target preferred designation, and finally decide whether to convert this once statically, or in queries/views dynamically.
Upvotes: 2