Bo Ye
Bo Ye

Reputation: 401

The performance comparison of multi-threaded inserts(updates) on database and single-threaded sequential inserts(updates)?

Let's imagine an environment: there are one db client and one db server. The db client could be a Java program or other, etc; the db server can be mysql, oracle, etc.

The requirement is inserting a large amount of records into one table on the db server.

The simplest way is having a loop inside which the client inserts one record each time until all records are inserted. This is single threaded sequential inserts.

There is another multi-threaded concurrent inserts way that let client fire up multiple threads simultaneously each of which inserts one record into the table. Intuitively, because those records are independent and there is an assumption that modern db servers come with RAID in which the concurrent IO is well-supported, they seem to be able to gain practical and true concurrency for multiple inserts, therefore, this way could improve performance, compared to the above method.

However, as soon as I dived into more details, it turns out that's probably not the case. This link -- Multi threaded insert using ORM? says inserts on the same table require a lock for every single write on the whole table. Therefore, every insert just blocks another following insert, eventually, this way is just another type of sequential multiple inserts, no performance gain at all.

My question are as follows:

  1. Why do most DBs treat multi threaded inserts on same table like this way?
  2. Why is the insert lock on the whole table compulsory?
  3. Are multiple threaded updates treated similarly as multi threaded inserts?

Despite it seems that the best way to deal with a large amount inserts is to have batch insert enabled, I'm still very curious of the grounds of locking whole table while inserting happens.

Thanks in advance!

=====================================================================

After plenty of readings and research, it shows that my questions were actually wrong. The real thing is one insert doesn't block another insert at the same time.(at least it is true for Oracle).

Upvotes: 2

Views: 5229

Answers (3)

Bo Ye
Bo Ye

Reputation: 401

Nothing is better than writing a demo to prove a theory.

I've worked out a following demo to compare the performance between single thread sequential inserts(without batch) and multi-thread inserts against Oracle.

import com.zaxxer.hikari.HikariConfig;
import com.zaxxer.hikari.HikariDataSource;

import javax.sql.DataSource; 
import java.sql.Connection;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

/**
  * Created by boy on 14/03/17.
*/
public class DatabasePerformanceTest {
  //constants
  private static final String DBURL ="jdbc:oracle:thin:@xxxxxxxx:1521:xxxxx";
  private static final String DBUSER = "xxx";
  private static final String DBPASS = "xxxx";
  private static final Integer INSERT_AMOUNT = 10000;
  private static final String INSERT_PERSON = "insert into Persons values(1, 'xx', 'xx', 'xxxxxxx', 'xxxxxxx')";
  //pools
  private DataSource ds;
  private ExecutorService executor;

public static void main(String[] args) throws SQLException, InterruptedException {
    DatabasePerformanceTest test = new DatabasePerformanceTest();
    test.setUp();
    long begin = System.currentTimeMillis();
    //test.insertByRowByRow();
    test.insertByMultipleThreads();
    long end = System.currentTimeMillis();
    System.out.println("Time spent:" + (end - begin) + "ms");
}

private void setUp() {
    HikariConfig config = new HikariConfig();
    config.setJdbcUrl(DBURL);
    config.setUsername(DBUSER);
    config.setPassword(DBPASS);
    config.addDataSourceProperty("cachePrepStmts", "true");
    config.addDataSourceProperty("prepStmtCacheSize", "250");
    config.addDataSourceProperty("prepStmtCacheSqlLimit", "2048");
    config.addDataSourceProperty("dataSourceClassName", "oracle.jdbc.driver.OracleDriver");
    ds = new HikariDataSource(config);
    this.executor = Executors.newFixedThreadPool(128);
}

private void insertOnePerson(Connection connection) throws SQLException {
    Statement statement = null;
    try {
        statement = connection.createStatement();
        statement.execute(INSERT_PERSON);
    } finally {
        try {
            if (statement != null) {
                statement.close();
            }
            if (connection != null) {
                connection.close();
            }
        } catch (SQLException e) {
            e.printStackTrace();
        }
    }
    System.out.println("Inserting one person is done.");
}

private void insertByRowByRow() throws SQLException {
    for (int i = 0; i < INSERT_AMOUNT; i++) {
        this.insertOnePerson(ds.getConnection());
    }
}

private void insertByMultipleThreads() throws InterruptedException {
    for (int i = 0; i < INSERT_AMOUNT; i++) {
        executor.execute(new Runnable() {
            @Override
            public void run() {
                try {
                    insertOnePerson(ds.getConnection());
                } catch (SQLException e) {
                    e.printStackTrace();
                }
            }
        });
    }
    executor.shutdown();
    executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
   }
}

After testing, it clearly showed the multi-threading inserts is about 4 times faster than single-threading serialised inserts(without batch).

Therefore, the first answer for the link -- Multi threaded insert using ORM? is wrong.

That being said, like BobC mentioned, the above approach is a "home-grown" way, the best approach for dealing with a large amount of inserts is batch inserts.(loading sets of rows)

Upvotes: 3

Gray
Gray

Reputation: 116828

The simplest way is having a loop inside which the client inserts one record each time until all records are inserted. This is single threaded sequential inserts.

Even with single threaded operations, suspending auto-commit (which may be accomplished by starting a transaction), inserting a number of entries in a batch, and then committing the changes is vastly more efficient then doing inserts 1-by-1.

... there is an assumption that modern db servers come with RAID in which the concurrent IO is well-supported

In reality, there may be no such thing as concurrent IO at the hardware level. IO requests may be serialized just like packets in a network interface even though we think of their being multiple concurrent connections to a server. However, it is the queuing up of the IO requests by multiple application threads which maximizes the IO bus.

As an aside, RAID is usually also serial IO and can often be even slower than single devices – especially when we are talking about writes. RAID5 is so slow, for example, that most high performance clusters use RAID50 to try to get performance to adequate levels.

Why do most DBs treat multi threaded inserts on same table like this way?

This is highly dependent on the database type and may have to do with how it keeps the tables organized. Most inserts are writing to the end of a data table (or fighting for the empty rows) and so therefore multiple threads would be contenting for the same disk space making sequencing of the inserts efficient.

Why is the insert lock on the whole table compulsory?

It's not. Again, this depends highly on how the database implements the inserts.

Are multiple threaded updates treated similarly as multi threaded inserts?

I don't think so. Updates happen at different places in the table although regional locks might be used and certainly locks on indexes will be crossed if indexes fields are updated.

You question really should be "how do I maximize my insert bandwidth". I (and others) have mentioned batching up the inserts as the first step. You also need to make sure you are using database connection pooling – this is also important with single-threaded db operations. Pooling means both that you can use multiple connections concurrently, and you don't have to create a connection for each database transaction. There are many database connection pooling libraries out there. We use HikariCP.

Hope this helps.

Upvotes: 1

BobC
BobC

Reputation: 4416

This answer requires an understanding of the database that is beyond the scope of a simple answer here. Since you ask about Oracle:

Oracle does not lock the whole table in the way you think it does. During an insert, there is a lock on what is essentially the table structure (ie so someone cannot drop a column mid insert), but at the data level, there is no lock. This means that you can have many concurrent inserts on single table. Updates (in Oracle) are similar. In this case however, there is a row lock on the data being updated. So you can have many concurrent updates on the same table; but not on the same row.

Have said all that, multi-threaded inserts is not that way to load large amounts of data. For this, Oracle provides an alternative approach, which is a direct path load. In this approach, we load sets of rows, rather than row-by-row (slow-by-slow). It's not that a single insert is slow; quite the opposite, they are very fast. But even at 0.1ms per insert, when you have to load 100M rows, that's 2.7 hours! As set based approach allows the database to perform parallelism, rather than the manual "home grown" multi-threaded approach So to give you an idea of what can be done, I just loaded about 6 billion rows (about 1 TB of data ) in around 10 mins. Finally, data loading is typically CPU constrained; not IO constrained.

Upvotes: 5

Related Questions