Optimizing tables that already have duplicate entries

Question

Currently I have raw data that is formatted like this

             ---------------------------------------------------------------------
            | User           Number           charges        cost     account#   |
            | John Smith    555-555-5555      Rental         $700      12345     |
            | John Smith    555-555-5555      Phone          $100      12345     |
            | Mike Jones    444-444-4444      Late Pay       $50       98765     |
            | John Smith    555-555-5555      Surcharge      $100      12345     |
            ----------------------------------------------------------------------

This is just an example that I threw together, but the setup is close to the same in the aspect that the raw data doesn't have things grouped appropriately. As you can see the breakdown lists John Smith and his info, then jumps to Mike Jones, and then back to John Smith, creating many duplicate entries in the user, number, and account# column. The real data file that I have, is set up in this way, but consists of thousands of rows and 10-20 columns. I've been trying to find a way to break the username and number off into a separate table in which i could create a relationship with the rest of the file in order to help optimize the table. I realize I could do this if I typed in each entry manually, but that would obviously take way to much time and wouldn't be feasible each time one of these data files needs to be uploaded to the database.

Originally I thought I would just create a user name table that would hold the name and number of the user, and for each user I would have an auto incremented key that would link to the rest of the table. The issue I ran into was that in the case of someone like John Smith I would have a table that looks like:

               --------------------------------------
              | ID        User       Number         |
              | 1        John Smith  555-555-5555   |
              | 2        Mike Jones  444-444-4444   |
              ---------------------------------------

And I would have to go in and enter a foreign key of 1 for every instance of his name in the main table:

             -----------------------------------------------------
            | ID        Charges        Cost          account#    |
            | 1         Rental         $700          12345       |
            | 1         Phone          $100          12345       | 
            | 2         Late Pay       $50           98765       |
            | 1         Surcharge      $100          12345       |
           --------------------------------------------------------

IS there a way to connect the tables without going in and manually entering in 1 every time the charges table has something related to John Smith?

Daileyo · Accepted Answer

You could do this with a third table that works as a link between the two.

Using your example, I'd consider doing something like:

user_table

   -----------------------------------------
  |    User       Number         acccount# |
  |   John Smith  555-555-5555   12345     |
  |   Mike Jones  444-444-4444   98765     |
  ------------------------------------------

order_table

     ------------------------------------------
    | Order#        Charges        Cost       |
    | 1             Rental         $700       |
    | 2             Phone          $100       | 
    | 3             Late Pay       $50        |
    | 4             Surcharge      $100       |
     ------------------------------------------

user_orders_table

     ----------------------------
    | account#        order#    |
    | 12345               1     |
    | 12345               2     | 
    | 98765               3     |
    | 12345               4     |
     ----------------------------

Order# would be auto-incremented by your database. Account number is assumed to be something unique by design, so it could serve as your primary key on the user_table. And then the user_orders_table would be the means of storing the link between the two tables.

Update:

If you need to keep the database managed identifier for the user table, then the example could be adjusted as follows:

user_table

   -------------------------------------------------
  | user_uid   User       Number         acccount# |
  | 1          John Smith  555-555-5555   12345    |
  | 2          Mike Jones  444-444-4444   98765    |
  --------------------------------------------------

order_table

     ------------------------------------------
    | Order#        Charges        Cost       |
    | 1             Rental         $700       |
    | 2             Phone          $100       | 
    | 3             Late Pay       $50        |
    | 4             Surcharge      $100       |
     ------------------------------------------

user_orders_table

     ----------------------------
    | user_id         order#    |
    | 1               1         |
    | 1               2         | 
    | 2               3         |
    | 1               4         |
     ----------------------------

UPDATE 2

One way you could implement this would be using a procedure to do your inserts. The goal of using the stored procedure would be to implement the steps mentioned in the comments.

Insert new order

1.  Check for user

   1.a  User exists, get their id

   1.b  User does not exits.  Insert them and get their id.

2.  Create new order record

3.  Create user_orders record

Please treat this as pseudo code.

I don't have access to a MySQL database to test this... and it is really just intended to give a little more detailed theoretical example of how to implement.

DELIMITER //
CREATE PROCEDURE InsertOrder(
    IN 
        p_user_name VARCHAR(50), 
        p_user_phone_number VARCHAR(13), 
        p_user_account_number VARCHAR(50),
        p_order_charges VARCHAR(15), 
        p_order_cost MEDINT
)
 BEGIN
    //check for the user in the database. If they exist, get their id.  If they don't, insert them
    IF EXISTS (SELECT user_uid FROM USERS_TABLE WHERE user_name = p_user_name) //user exists... get their uid
        r_user_uid = user_uid;
    ELSE  //user doesn't exist... create the sucker
        INSERT INTO USER_TABLE (user_name, user_phone_number, user_account_number) 
                    VAULES(p_user_name, p_user_phone_number, p_user_account_number); //assuming that user_uid is an auto_incremented value
        r_user_uid = SELECT user_uid FROM USERS_TABLE WHERE user_name = p_user_name;
    END IF
    //insert your order information NOTE:  could be prudent to check if the order exists already... but we will assume it doesn't for the sake of the pseudo code example.
    INSERT INTO ORDER_TABLE (order_charges, order_cost) 
        VALUES (p_order_charges, p_order_cost); --assuming that order number is an auto_incremented value.
        r_order_id = SELECT max(order_number) FROM ORDER_TABLE //assumes the latest entry is the correct one... may not be the most robust way to do this... depending on how this is to be used overall.
    //Now create the user_orders record
    INSERT INTO USER_ORDERS_TABLE (user_id, order_number) VALUES (r_user_uid, r_order_number);
 END //
DELIMITER ;

PLEASE NOTE: In additional comments, after the original question, you mentioned that there is no real unique data for the user in your data sets. This can be a real problem for you, in regards to implementing a solution like this. To have a good table of users, you need a way for the data itself to be unique. Relying solely on database generated identifiers is not ideal... and not recommended. For example, if you want to enter

User      Number             account#
John Doe  555-555-5555       12345
John Doe  555-444-3333       12345
John Doe  555-555-5555       12345

Is this a list of two different users with the same name; or the same user, that has change their phone number? In order to create a good table of users, you must be able to determine this from your data that you are going to insert. As such, there would need to be some additional piece to uniquely identify the user.

You may have other data in the remaining columns that you omitted to help with this task... but if not... then creating a user table may not be the best solution for normalizing your data.

I hope that this helps.

Optimizing tables that already have duplicate entries

Answers (2)

Related Questions