Performance issue with bulk inserts

My application has to insert a large number of records in a single transaction. This data is a combination of records imported from an external source and created within the application. It is inserting about 8500 records spread across several tables in about 15 seconds. This seems like a long time. This is in my local system with the application and database on the same computer.

The tables use a sequence generator for the primary keys in most cases. The ones that don’t are ones that only have small numbers of inserts. Does mixing identity columns and sequence generators cause any performance issues?

I have one table that has a field updated by a trigger on insert. I don’t know another way to do this. It adds time because Hibernate has to select the field after insert. This isn’t the only slow part.

I tried using the batch_size settings but saw no difference. How can I tell if it is working or if the driver we are using supports it?

This application was created years ago, and we probably made some rookie mistakes. Other important info:

  • SQL Server 2019 - This was a recent update. We did not update the JDBC driver or change the dialect in Hibernate (SQLServer2012Dialect). Not sure if this would make any difference.
  • Hibernate - version 5.01, we cannot upgrade any time soon

Am I missing something obvious?

Batching inserts isn’t supported with identity columns, but other than that there’s no specific reason why mixing different generation styles would impact performance. Note, however, that executing inserts for different entity types “flushes” the current batch queue, so if you’re switching between different types to insert frequently it might impact the efficiency provided by batching.

Sequence style generators support batch inserts, I suggest using an appropriate increment size to allow fewer round-trips to the database to occur when retrieving identifier values.

As for the trigger-generated field, Hibernate 6.5 will introduce support for retrieving arbitrary non-identifier on-execution generated values efficiently (i.e. without executing a subsequent select) if your database supports it, which should significantly to speed those up as well.

Thank you for the response. One thing that confuses me in the documentation is this statement:

Hibernate disables insert batching at the JDBC level transparently if you use an identity identifier generator.

To me, this sounds like it turns of insert batching completely for everything. Is that not the case? I ran it through a JUnit, and every insert is a single SQL statement. It doesn’t look like batch inserts are working for me. These are my settings (pulled from another post):

<prop key="hibernate.jdbc.batch_size">50</prop>
<prop key="hibernate.order_inserts">true</prop>
<prop key="hibernate.order_updates">true</prop>

The data is inserted in 3 chunks. The first is a list of records that has the calculated DB field. (I will keep your suggestion in mind for when we eventually upgrade Hibernate.) Each of these records can have one or more child records. This is done through a persist with cascading inserts. All of these records use sequence generators.

The second chunk is a simple list.

The third chunk is a tree of records. The top level is always a single record, and this is what uses the Identity. The child records are in a list, and each child can have one or more children. This is done through a saveOrUpdate because the top level and some of the children may already exist. All of the child records use sequence generators.

I can convert everything to use sequence generators. It is a lot of tables. I would like to know whether it is really necessary before I do all of that work. This is the slowest area of my application. Data can be insert in other ways, which makes performance tuning a little tricky.

To me, this sounds like it turns of insert batching completely for everything. Is that not the case?

No, identity-generated identifiers, as well as having any on-execution generated field in your entity, only disabled batching for that specific entity type. You don’t need to change all your mappings to use sequences.

every insert is a single SQL statement

Hibernate logs SQL statements even when adding them to the current batch, so I’m not sure what you mean with that.

As an additional optimization, if you don’t need the inserted entity instances to be stored in the Persistence Context, using a StatelessSession would definetely help a lot in reducing the overhead of storing every object in memory, especially when dealing with a high number of rows, but note that this doesn’t support cascading. An alternative could be clearing the persistence context often, i.e. for every batch_size number of inserts.

Obviously Hibernate will always have more overhead when dealing with inserts when compared to pure SQL, consider for example what’s needed to make cascading work. If you wish for us to look into this further, try coming up with a reproducer test case that demonstrates the performance issues and attach it to a new issue in our issue tracker.

Hibernate logs SQL statements even when adding them to the current batch, so I’m not sure what you mean with that.

A post on another site said that the output would look different with batch_size enabled. I may have misunderstood. I was looking for evidence that batching was working to make sure I configured it correctly.

In this case, I don’t need the cascading to work. I could do it myself for these bulk inserts. I didn’t see a way with my current design to disable the cascading that has been defined, and I am not certain that I want to remove it. There may be use cases where it is needed.

I am looking into StatelessSession and flushing the context. These would probably be part of a larger redesign. The application has grown a lot in both volume of data and complexity since we built it. Our methods may be too generic. I may need to break it out to be optimized for different use cases.

A few follow-up notes:

The batch_size parameter is still not changing the time it takes for inserts. I cannot tell if it is working at all. I also tried to use the flush and clear suggestion for large inserts. This made the performance much worse. I may need to create a separate small sample program to test it.

I am using an increment size on all the sequences. I may need to tune that.

I am reading about the stateless session. I don’t think it will work in my case. What I really need is a mix. Certain related entities are loaded conditionally. The lazy loading helps with this. Some entities are read-only because they come from external systems. The large inserts that I am doing are dependent on loading this data first to determine where the new data goes. I will keep researching.

One more follow-up. I think I am able to confirm that batch_size is working. I enabled generating statistics. When I did that, I got a weird error message. I had to disable my level 2 cache, which doesn’t really work in a JUnit anyway. Once I did that, I was able to see message like “1623862900 nanoseconds spent executing 1896 JDBC batches”, and it went down to 0 JDBC batches when I turned off batch_size.

<prop key="hibernate.generate_statistics">true</prop>

The difference between batch_size on and off only varies by a little bit, which is why I couldn’t tell the difference. Depending on what number I set it to, it can be worse. I need to do more learning. I am going to look into an alternate method to do these bulk inserts, such as using a stored procedure. I hope this helps someone else.

JDBC batch insert performance heavily depends on the JDBC driver support, so try updating to the latest version and if that doesn’t help, contact the JDBC driver implementors to see if you can improve performance further.

I saw reports in the past that an ad-hoc statement insert into tbl (..) values (..), (..) i.e. with multiple values is faster than using a single JDBC batched prepared statement insert into tbl (..) values (..).

I don’t know the JDBC driver version or if this is actually true, but either way, I think that if a database is better at processing one style over the other, the JDBC driver should be altered to use the better performing style. It’s up to you users though to request such changes, and to support the change, you will have to provide a reproducer that actually shows the difference.

Before I ran the tests, I upgraded to the latest version of the SQL Server driver. I do not have comparisons to the original driver that we were using. If there was a difference, it was not noticeable.

		<dependency>
			<groupId>com.microsoft.sqlserver</groupId>
			<artifactId>mssql-jdbc</artifactId>
			<version>12.4.2.jre8</version>
		</dependency>

I will try to find additional information about how the driver and SQL Server work with bulk inserts.

SQL Server also supports passing JSON. I am looking into this combined with stored procedures as a possible way to insert or select large amounts of data.