Reputation: 153
In my database, I have cities and states that (appear to be) are stored from user input. I am trying to get only distinct locations, so my query looks like this:
SELECT Distinct C.City, C.State FROM Customers C
The problem I ran into is that some cities have typos, so in my results there could be "Dallas" TX, "Dalas" TX, "Dallas," TX, and so on. Is there a way to filter for the "correct" spelling without having to sort through ten thousand rows?
Upvotes: 0
Views: 894
Reputation: 7180
You're design should be addressed, storing free text cities and states tends to produce issues...well, like this. If you can address the design, do so...if you are forced to work with this, then I'm afraid you are in for some manual fun.
Create a table as city,state,equivlent_city, equivlent_state
Select all distinct values from city/state from your table and export them into a spreadsheet (copy and paste works too). Go through each line in this spreadsheet and assign the proper spelling to each city state that you want it to appear under...ya, this can be a long tedious process (last time I did this, I delegated this fantastic task to the summer students). When you're done, import the spreadsheet back into your database as a table. Anytime you need to refer to the city/state from your fact table, join to this translation table on city and state, then refer to equivalent_city and equivalent_state in youre queries.
Advantage to this method is this translation table grows as your users find new and interesting ways to spells cities (you will need a process to continue to update this table with new spellings)....there is an upper limit on screwy spellings out there and you'll eventually capture the majority of them. Disadvantage is in maintaining the setup (and the disbelief that even though you thought you had all spellings of Dallas tx captured, some user will input 'Daalass tex').
Alternative is stated in the comments, better design prevents this horrible solution from being needed.
Upvotes: 0
Reputation: 2475
Well, this seems like a huge problem. If you have ZIP codes, you can enrich the data. But judging by your question, this is not the case.
I think you might import a table with unique states and cities and join this table to your table. The non-matching rows might be checked and updated. Still an awful job though.
Good luck.
Upvotes: 1