Reputation: 5911
I have these tables:
create table person (
person_id int unsigned auto_increment,
person_key varchar(40) not null,
primary key (person_id),
constraint uc_person_key unique (person_key)
)
-- person_key is a varchar(40) that identifies an individual, unique
-- person in the initial data that is imported from a CSV file to this table
create table marathon (
marathon_id int unsigned auto_increment,
marathon_name varchar(60) not null,
primary key (marathon_id)
)
create table person_marathon (
person_marathon _id int unsigned auto_increment,
person_id int unsigned,
marathon_id int unsigned,
primary key (person_marathon_id),
foreign key person_id references person (person_id),
foreign key marathon_id references person (marathon_id),
constraint uc_marathon_person unique (person_id, marathon_id)
)
Person table is populated by a CSV that contains about 130,000 rows. This CSV contains a unique varchar(40) for each person and some other person data. There is no ID in the CSV.
For each marathon, I get a CSV that contains a list of 1k - 30k persons. The CSV contains essentially just a list of person_key
values that show which people participated in that specific marathon.
What is the best way to import the data into the person_marathon
table to maintain the FK relationship?
These are the ideas I can currently think of:
Pull the person_id + person_key
information out of MySQL and merge the person_marathon
data in PHP to get the person_id
in there before inserting into the person_marathon
table
Use a temporary table for insert... but this is for work and I have been asked to never use temporary tables in this specific database
Don't use a person_id
at all and just use the person_key
field but then I would have to join on a varchar(40)
and that's usually not a good thing
Or, for the insert, make it look something like this (I had to insert the <hr>
otherwise it wouldn't format the whole insert as code):
insert into person_marathon
select p.person_id, m.marathon_id
from ( select 'person_a' as p_name, 'marathon_a' as m_name union
select 'person_b' as p_name, 'marathon_a' as m_name )
as imported_marathon_person_list
join person p
on p.person_name = imported_marathon_person_list.p_name
join marathon m
on m.marathon_name = imported_marathon_person_list.m_name
The problem with that insert is that to build it in PHP, the imported_marathon_person_list
would be huge because it could easily be 30,000 select union
items. I'm not sure how else to do it, though.
Upvotes: 2
Views: 2083
Reputation: 4813
I've dealt with similar data conversion problems, though at a smaller scale. If I'm understanding your problem correctly (which I'm not sure of), it sounds like the detail that makes your situation challenging is this: you're trying to do two things in the same step:
In a nutshell, I would do everything possible to avoid doing both of these things in the same step. Break it into those two steps - import all the data first, in tolerable form, and optimize it later. Mysql is a good environment to do this sort of transformation, because as you import the data into the persons and marathons tables, the IDs are set up for you.
Step 1: Importing the data
Once you're done importing the data, you have three tables: * persons - you have the ugly personkey, as well as a newly generated person_id, plus any other fields * marathons - you should have a marathon_id at this point, right? either newly generated, or a number you've carried over from some older system. * persons_marathons - this table should have marathon_id filled in & pointing to the correct row in the marathons table, right? You also have personkey (ugly but present) and person_id (which is still null).
Step 2: Use personkey to fill in person_id for each row in the association table
Then you either use straight Mysql, or write a simple PHP script, to fill in person_id for each row in the persons_marathons table. If I'm having trouble getting mysql to do this directly, I'll often write a php script to deal with a single row at a time. The steps in this would be simple:
You can tell PHP to repeat this 100 times then end script, or 1000 times, if you keep getting timeout problems or anything like taht.
This transformation involves a huge number of lookups, but each lookup only needs to be for a single row. That's appealing because at no point do you need to ask mysql (or PHP) to "hold the whole dataset in its head".
At this point, your associations table should have person_id filled in for every row. It's now safe to delete the personkey column, and voila, you have your efficient foreign keys.
Upvotes: 2