natuslaedo
natuslaedo

Reputation: 429

Hibernate performance issue while inserting massive data

We will migrate large amounts of data (a single type of entity) from Amazon's DynamoDB into a MySQL DB. We are using Hibernate to map this class into a mysql entity. There are around 3 million entities (excluding rows of list property). Here is our class mapping summary:

@Entity
@Table(name = "CUSTOMER")
public class Customer {
    @Id
    @Column(name = "id")
    private String id;

    //Other properties in which all of them are primitive types/String

    @ElementCollection
    @CollectionTable(name = "CUSTOMER_USER", joinColumns = @JoinColumn(name = "customer_id"))
    @Column(name = "userId")
    private List<String> users;

    // CONSTRUCTORS, GETTERS, SETTERS, etc.
}

users is a list of String. We have created two mysql tables like following:

CREATE TABLE CUSTOMER(id VARCHAR(100), PRIMARY KEY(id));
CREATE TABLE CUSTOMER_USER(customer_id VARCHAR(100), userId VARCHAR(100), PRIMARY KEY(customer_id, userId), FOREIGN KEY (customer_id) REFERENCES CUSTOMER(id));

Note: We do not make hibernate generate any id value, we are assigning our IDs to Customer entities which are guaranteed to be unique.

Here is our hibernate.cfg.xml:

<hibernate-configuration>    
    <session-factory>    
    <property name="hibernate.dialect">   org.hibernate.dialect.MySQLDialect </property>    
    <property name="hibernate.connection.driver_class"> com.mysql.jdbc.Driver </property>  
    <property name="hibernate.connection.url"> jdbc:mysql://localhost/xxx </property>    
    <property name="hibernate.connection.username"> xxx </property>    
    <property name="hibernate.connection.password"> xxx </property>
    <property name="hibernate.connection.provider_class">org.hibernate.c3p0.internal.C3P0ConnectionProvider</property>
    <property name="hibernate.jdbc.batch_size"> 50 </property>
    <property name="hibernate.cache.use_second_level_cache">false</property>
    <property name="c3p0.min_size">30</property>
    <property name="c3p0.max_size">70</property>
    </session-factory> 
</hibernate-configuration>

We are creating some number of threads each reading data from Dynamo and inserting them to our MySQl DB via Hibernate. Here is what each thread does:

// Each single thread brings resultItems from DynamoDB
Session session = factory.openSession();
Transaction tx = session.beginTransaction();
for(int i = 0; i < resultItems.size(); i++) {
    Customer cust = new Customer(resultItems.get(i));
    session.save(cust);
    if(i % BATCH_SIZE == 0) {
        session.flush();
        session.clear();
    }
}
tx.commit();
session.close();

We have our own performance monitoring functions and we are continuously logging the overall read/write performance. The problem is, migration starts with reading/writing 1500 items/sec (on average), but keeps getting slowed as long as number of rows in CUSTOMER and CUSTOMER_USER tables increases (after a few minutes, r/w speed was around 500 items/sec). I am not experienced on Hibernate and here are my questions:

  1. What should hibernate.cfg.xml be like for a multi-threaded task like ours? Is the content which i gave above fits for such a task or is there any wrong/missing point?
  2. There are exactly 50 threads and each does following: Read from DynamoDB first, and then insert the results into mysql db, then read from dynamo, and so on. Therefore, uptime of communication with hibernate is not 100%. Under these circumstances, what do you recommend to set min_size and max_size of c3p0 connection pool sizes? To be able to understand the concept, should I also set remaining c3p0-related tags in hibernate.cfg.xml?
  3. What can be done to maximize the speed of bulk inserting?

NOTE 1 I did not write all of the properties, because the remaining ones other than list of users are all int, boolean, String, etc.

NOTE 2 All of points are tested and have no negative effect on performance. When we dont insert anything into mysql db, read speed stays stable for hours.

NOTE 3 Any recommendation/guidance about the structure of mysql tables, configuration settings, sessions/transactions, number of connection pools, batch sizes, etc. will be really helpful!

Upvotes: 3

Views: 3269

Answers (2)

vanOekel
vanOekel

Reputation: 6538

In your scenario there are 25 threads batch-inserting data into one table simultaneously. MySQL has to maintain ACID properties while 25 transactions for many records in one table remain open or are being committed. That can cause a huge overhead.

While migrating data from databases, network latency can cause significant delays when there are many back-and-forth communications with the database. In this case, using multiple threads can be beneficial. But when doing batch fetches and batch inserts, there is little to gain as the database drivers will (or should) communicate data without doing much back-and-forth communications.

In the batch-scenario, start with 1 thread that reads data, prepares a batch and puts it in a queue for 1 thread that is writing data from the prepared batches. Keep the batches small (100 to 1 000 records) and commit often (every 100 records or so). This will minimize the overhead for maintaining the table. If network latency is a problem, try using 2 threads for reading and 2 for writing (but any performance gain might be offset by the overhead for maintaining the table used by 2 threads simultaneously).

Since there is no generated ID, you should benefit from the hibernate.jdbc.batch_size option already in your hibernate configuration. The hibernate.jdbc.fetch_size option (set this to 250 or so) might also be of interest.

As @hermant1900 mentions, using the StatelessSession is also a good idea. But by far the fastest method is mentioned by @Rob in the comments: use database tools to export the data to a file and import it in MySQL. I'm quite sure this is also the preferred method: it takes less time, less processing and there are fewer variables involved - overall a lot more reliable.

Upvotes: 0

hemant1900
hemant1900

Reputation: 1226

Assuming you are not doing anything else in the hibernate transaction than just inserting the data into these two tables, you can use StatelessSession session = sessionFactory.openStatelessSession(); instead of normal session which reduces the overhead of maintaining the caches. But then you will have to save the nested collection objects separately. Refer https://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html

So it could be something like -

// Each single thread brings resultItems from DynamoDB
StatelessSession session = factory.openStatelessSession();
Transaction tx = session.beginTransaction();
for(int i = 0; i < resultItems.size(); i++) {
    Customer cust = new Customer(resultItems.get(i));   
    Long id = session.save(cust); // get the generated id
    // TODO: Create a list of related customer users and assign the id to all of them and then save those customer user objects in the same transaction.  
    if(i % BATCH_SIZE == 0) {
        session.flush();
        session.clear();
    }
}
tx.commit();
session.close();

Upvotes: 2

Related Questions