gloomy.penguin
gloomy.penguin

Reputation: 5911

How to import "a lot" of data to MySQL with PHP and foreign keys?

I have these tables:

create table person (
    person_id int unsigned auto_increment, 
    person_key varchar(40) not null, 
    primary key (person_id), 
    constraint uc_person_key unique (person_key)
) 
-- person_key is a varchar(40) that identifies an individual, unique 
-- person in the initial data that is imported from a CSV file to this table

create table marathon (
    marathon_id int unsigned auto_increment,  
    marathon_name varchar(60) not null, 
    primary key (marathon_id) 
)

create table person_marathon (
    person_marathon _id int unsigned auto_increment,  

    person_id int unsigned, 
    marathon_id int unsigned,

    primary key (person_marathon_id),
    foreign key person_id references person (person_id), 
    foreign key marathon_id references person (marathon_id),

    constraint uc_marathon_person unique (person_id, marathon_id)  
)

Person table is populated by a CSV that contains about 130,000 rows. This CSV contains a unique varchar(40) for each person and some other person data. There is no ID in the CSV.

For each marathon, I get a CSV that contains a list of 1k - 30k persons. The CSV contains essentially just a list of person_key values that show which people participated in that specific marathon.

What is the best way to import the data into the person_marathon table to maintain the FK relationship?

These are the ideas I can currently think of:

Upvotes: 2

Views: 2083

Answers (1)

Topher Hunt
Topher Hunt

Reputation: 4813

I've dealt with similar data conversion problems, though at a smaller scale. If I'm understanding your problem correctly (which I'm not sure of), it sounds like the detail that makes your situation challenging is this: you're trying to do two things in the same step:

  • import a large number of rows from CSV into mysql, and
  • do a transformation such that the person-marathon associations work through person_id and marathon_id, rather than the (unwieldy and undesirable) varchar personkey column.

In a nutshell, I would do everything possible to avoid doing both of these things in the same step. Break it into those two steps - import all the data first, in tolerable form, and optimize it later. Mysql is a good environment to do this sort of transformation, because as you import the data into the persons and marathons tables, the IDs are set up for you.

Step 1: Importing the data

  • I find data conversions easier to perform in a mysql environment than outside of it. So get the data into mysql, in a form that preserves the person-marathon associations even if it isn't optimal, and worry about changing the association approach afterwards.
  • You mention temp tables, but I don't think you need any. Set up a temporary column "personkey", on the persons_marathons table. When you import all the associations, you'll leave person_id blank for now, and just import personkey. Importantly, ensure that personkey is an indexed column both on the associations table and on the persons table. Then you can go through later and fill in the correct person_id for each personkey, without worrying about mysql being inefficient.
  • I'm not clear on the nature of the marathons table data. Do you have thousands of marathons to enter? If so, I don't envy you the work of handling 1 spreadsheet per marathon. But if it's fewer, then you can perhaps set up the marathons table by hand. Let mysql generate marathon IDs for you. Then as you import the person_marathon CSV for each marathon, be sure to specify that marathon ID in each association relevant to that marathon.

Once you're done importing the data, you have three tables: * persons - you have the ugly personkey, as well as a newly generated person_id, plus any other fields * marathons - you should have a marathon_id at this point, right? either newly generated, or a number you've carried over from some older system. * persons_marathons - this table should have marathon_id filled in & pointing to the correct row in the marathons table, right? You also have personkey (ugly but present) and person_id (which is still null).

Step 2: Use personkey to fill in person_id for each row in the association table

Then you either use straight Mysql, or write a simple PHP script, to fill in person_id for each row in the persons_marathons table. If I'm having trouble getting mysql to do this directly, I'll often write a php script to deal with a single row at a time. The steps in this would be simple:

  1. look up any 1 row where person_id is null but personkey is not null
  2. look up that personkey's person_id
  3. write that person_id in the associations table for that row

You can tell PHP to repeat this 100 times then end script, or 1000 times, if you keep getting timeout problems or anything like taht.

This transformation involves a huge number of lookups, but each lookup only needs to be for a single row. That's appealing because at no point do you need to ask mysql (or PHP) to "hold the whole dataset in its head".

At this point, your associations table should have person_id filled in for every row. It's now safe to delete the personkey column, and voila, you have your efficient foreign keys.

Upvotes: 2

Related Questions