Pioz
Pioz

Reputation: 6321

MySQL bulk insert on multiple tables

I have a MySQL database with 2 tables products and product_variants. A product has many product variants. Here a sample:

products
+----+------+
| id | name |
+----+------+
|  1 | Foo  |
|  2 | Bar  |
+----+------+

product_variants
+----+-------------+--------+
| id | product_id  | value  |
+----+-------------+--------+
| 10 |           1 | red    |
| 11 |           1 | green  |
| 12 |           1 | blue   |
| 13 |           2 | red    |
| 14 |           2 | yellow |
+----+-------------+--------+

Now I need to bulk insert a lot of products with their variants in the most efficient and fastest way. I have a JSON with many products (100k+) like this:

[
  {
    "name": "Foo",
    "variants": [{ "value": "red" }, { "value": "green" }, { "value": "blue" }]
  },
  {
    "name": "Bar",
    "variants": [{ "value": "red" }, { "value": "yellow" }]
  },
  ...
]

from which I should generate a query to insert the products.

My idea is to use an insert query like this:

INSERT INTO `products` (name) VALUES ("foo"), ("bar"), ...;

But then I don't know what product_id (foreign key) to use in the insert query for product_variants:

INSERT INTO `product_variants` (product_id,value) VALUES (?,"red"), (?,"green"), ...;

(these queries inside a transaction)

I've thought to specify the product ids manually, in increment way from the last id, but I get errors when concurrency connection insert products at the same time or when 2 or more bulk insert processes run concurrently.

What strategy can I use to achieve my goal? Is there a standard way to do this?

ps: if possible I would not want to change the structure of the 2 tables.

Upvotes: 4

Views: 2885

Answers (3)

Pioz
Pioz

Reputation: 6321

Finally, I've used a strategy that uses the MySQL function LAST_INSERT_ID() like @sticky-bit sad but using bulk insert (1 insert for many products) that is much faster.

I attach a simple Ruby script to perform bulk insertions. All seems works well also with concurrency insertions.

I've run the script with the flag innodb_autoinc_lock_mode = 2 and all seems good, but I don't know if is necessary to set the flag to 1:

require 'active_record'
require 'benchmark'
require 'mysql2'
require 'securerandom'

ActiveRecord::Base.establish_connection(
  adapter:  'mysql2',
  host:     'localhost',
  username: 'root',
  database: 'test',
  pool:     200
)

class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true
end

class Product < ApplicationRecord
  has_many :product_variants
end

class ProductVariant < ApplicationRecord
  belongs_to :product
  COLORS = %w[red blue green yellow pink orange].freeze
end

def migrate
  ActiveRecord::Schema.define do
    create_table(:products) do |t|
      t.string :name
    end

    create_table(:product_variants) do |t|
      t.references :product, null: false, foreign_key: true
      t.string :color
    end
  end
end

def generate_data
  d = []
  100_000.times do
    d << {
      name: SecureRandom.alphanumeric(8),
      product_variants: Array.new(rand(1..3)).map do
        { color: ProductVariant::COLORS.sample }
      end
    }
  end
  d
end

DATA = generate_data.freeze

def bulk_insert
  # All inside a transaction
  ActiveRecord::Base.transaction do
    # Insert products
    values = DATA.map { |row| "('#{row[:name]}')" }.join(',')
    q = "INSERT INTO products (name) VALUES #{values}"
    ActiveRecord::Base.connection.execute(q)

    # Get last insert id
    q = 'SELECT LAST_INSERT_ID()'
    last_id, = ActiveRecord::Base.connection.execute(q).first

    # Insert product variants
    i = -1
    values = DATA.map do |row|
      i += 1
      row[:product_variants].map { |subrow| "(#{last_id + i},'#{subrow[:color]}')" }
    end.flatten.join(',')
    q = "INSERT INTO product_variants (product_id,color) VALUES #{values}"
    ActiveRecord::Base.connection.execute(q)
  end
end

migrate

threads = []

# Spawn 100 threads that perform 200 single inserts each
100.times do
  threads << Thread.new do
    200.times do
      Product.create(name: 'CONCURRENCY NOISE')
    end
  end
end

threads << Thread.new do
  Benchmark.bm do |benchmark|
    benchmark.report('Bulk') do
      bulk_insert
    end
  end
end

threads.map(&:join)

After running the script I've checked that all products have associated variants with the query

SELECT * 
FROM products
 LEFT OUTER JOIN product_variants
 ON (products.id = product_variants.product_id)
WHERE product_variants.product_id IS NULL
AND name != "CONCURRENCY NOISE";

and correctly I get no rows.

Upvotes: 0

Rick James
Rick James

Reputation: 142298

If you already have the JSON in a table, then it can probably be done (quite efficiently) with two statements:

INSERT INTO Products (name)
    SELECT name
        FROM origial_table;  -- to get the product names

INSERT INTO Variants (product_id, `value`)
    SELECT  ( SELECT id FROM Products WHERE name = ot.name ),
            `value`
        FROM origial_table AS ot;

In reality, name and value would need to be suitable JSON expressions to extract the values.

If you are concerned about lots of duplicate "products" in the first table, be sure to have UNIQUE(name). And you can avoid "burning" ids by the 2-step process described here: mysql.rjweb.org/doc.php/staging_table#normalization

Upvotes: 0

sticky bit
sticky bit

Reputation: 37472

You can use the last_insert_id() to get the last generated ID from the last statement. But since this, as mentioned, only gets the last ID of the statement, requires that you handle each product individually. You can bulk insert the variants though. But from the structure of the given JSON I'd think that makes it even easier to traverse that JSON. Each product and it's variant should be inserted in an transaction so that the variants of a product don't get added to the previous product if the INSERT into the product table fails for some reason.

START TRANSACTION;
INSERT INTO products
            (name)
            VALUES ('Foo');
INSERT INTO product_variants
            (product_id,
             value)
            VALUES (last_insert_id(),
                    'red'),
                   (last_insert_id(),
                    'green'),
                   (last_insert_id(),
                    'blue');
COMMIT;

START TRANSACTION;
INSERT INTO products
            (name)
            VALUES ('Bar');
INSERT INTO product_variants
            (product_id,
             value)
            VALUES (last_insert_id(),
                    'red'),
                   (last_insert_id(),
                    'yellow');
COMMIT;

db<>fiddle

Upvotes: 2

Related Questions