MassIndexer fails due to connection reset

Sylvain · January 18, 2024, 8:52am

Hello, I’m new to this mailing list so I hope that I’m sending my question to the right place.

We recently experienced issues with Lucene Mass indexation process with data stored in MariaDB.
The connection is closed after few seconds and the indexation is partially processed.

I’ve got a piece of code that is re-indexing a table containing contacts.
This code was running fine until we execute it on a table containing more than 2 millions of contacts.

In that configuration the process does not run entirely and stop due to the following exception.

11/15 16:12:32 ERROR rg.hibernate.search.exception.impl.LogErrorHandler - HSEARCH000058: HSEARCH000211: An exception occurred while the MassIndexer was fetching the primary identifiers list
org.hibernate.exception.JDBCConnectionException: could not advance using next()
at org.hibernate.exception.internal.SQLExceptionTypeDelegate.convert(SQLExceptionTypeDelegate.java:48)
at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:42)
at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:111)
at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:97)
at org.hibernate.internal.ScrollableResultsImpl.convert(ScrollableResultsImpl.java:69)
at org.hibernate.internal.ScrollableResultsImpl.next(ScrollableResultsImpl.java:104)
at org.hibernate.search.batchindexing.impl.IdentifierProducer.loadAllIdentifiers(IdentifierProducer.java:148)
at org.hibernate.search.batchindexing.impl.IdentifierProducer.inTransactionWrapper(IdentifierProducer.java:109)
at org.hibernate.search.batchindexing.impl.IdentifierProducer.run(IdentifierProducer.java:85)
at org.hibernate.search.batchindexing.impl.OptionallyWrapInJTATransaction.runWithErrorHandler(OptionallyWrapInJTATransaction.java:69)
at org.hibernate.search.batchindexing.impl.ErrorHandledRunnable.run(ErrorHandledRunnable.java:32)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.sql.SQLNonTransientConnectionException: (conn=18) Server has closed the connection. If result set contain huge amount of data, Server expects client to read off the result set relatively fast. In this case, please consider increasing net_wait_timeout session variable / processing your result set faster (check Streaming result sets documentation for more information)
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.get(ExceptionMapper.java:234)
at org.mariadb.jdbc.internal.util.exceptions.ExceptionMapper.getException(ExceptionMapper.java:165)
at org.mariadb.jdbc.internal.com.read.resultset.SelectResultSet.handleIoException(SelectResultSet.java:381)
at org.mariadb.jdbc.internal.com.read.resultset.SelectResultSet.next(SelectResultSet.java:650)
at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:1160)
at org.apache.commons.dbcp2.DelegatingResultSet.next(DelegatingResultSet.java:1160)
at org.hibernate.internal.ScrollableResultsImpl.next(ScrollableResultsImpl.java:99)
... 10 more

Here is the piece of code :

    private void index(EntityManager em, LongConsumer callBack) {
        FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(em);
        MassIndexer indexer = fullTextEntityManager.createIndexer(Contact.class);
        indexer.batchSizeToLoadObjects(BATCH_SIZE);
        indexer.threadsToLoadObjects(NB_THREADS);
        indexer.progressMonitor(new IndexerProgressMonitor(callBack));
        indexer.start();
    }

To make the process run until the end and after several unsuccessful tries, the only thing that seems to work is to edit mariadb config file and set : net_write_timeout = 7200
By default the time out is 60 sec. Here we put it to 1 hour… as the process took 45 minutes…
Does someone have an idea of what we did wrong ? It does not seems reasonable to set a such huge value to ask a full re-index on the table.

Is there a way to ask MassIndexer to process by limited chunk of data or something else to avoid the connection to stay open for a too long time ?

Thanks for your time.

yrodiere · January 18, 2024, 8:57am

Hello,

Hibernate Search uses scrolling to fetch IDs.

The authors of the MySQL (and probably MariaDB) JDBC driver made the dubious choice of fetching the whole resultset in memory by default when you use scrolling. So yes, if you have many IDs, I’d expect that to take a while. I’m surprised it can even work, to be honest.

You need to use .idFetchSize(Integer.MIN_VALUE) to disable that behavior in the JDBC driver.

See Hibernate Search 7.0.0.Final: Reference Documentation :

A note to MySQL users: the MassIndexer uses forward-only scrollable results to iterate on the primary keys to be loaded, but MySQL’s JDBC driver will preload all values in memory.

To avoid this “optimization” set the idFetchSize parameter to Integer.MIN_VALUE.

Sylvain · January 18, 2024, 1:55pm

Thanks a lot for your suggestion.
We missed this parameter, I’ll try it as soon as possible.

Sylvain · January 19, 2024, 9:36am

I’ve tested you suggestion but it does not run at all.
A soon as the indexation starts Maria throws an exception :

java.sql.SQLSyntaxErrorException: (conn=65) invalid fetch size

After a quick search I’ve found this comment on the net :

This is probably because tools have taken in account a MySQL bug : MySQL connector doesn’t follow JDBC spec for fetching size.
JDBC spec Statement (Java SE 17 & JDK 17) indicate that setFetchSize must throw an SQLException “if a database access error occurs, this method is called on a closed Statement or the condition rows >= 0 is not satisfied.”

MySQL only permit value of Integer.MIN_VALUE that is normally wrong, and when enable, fetch size of 1.

So I replaced the MIN_VALUE by 1 but now I’ve got the previous behaviour : the cnx is closed after 1 minute.

Sylvain · January 19, 2024, 2:01pm

To gvie some feedback : We finally change settings in JPA properties by adding those two lines

jpaProperties.setProperty(AvailableSettings.BATCH_FETCH_STYLE, DYNAMIC);
jpaProperties.setProperty(AvailableSettings.DEFAULT_BATCH_FETCH_SIZE, 300);

this drastically increase the whole indexation process.
It now tooks 2 min 15 instead of 45 minutes.

I hope the issue is now behind us.

yrodiere · January 22, 2024, 8:10am

Thanks for the update, glad you managed to fix this. Glad to know the MariaDB JDBC driver is not impacted by the same problem as the MySQL one, too.

Setting a batch fetch size is indeed recommended for mass indexing, because it mitigates the impact of associations on entity loading times (SELECT N+1 problem).

That being said, I’m surprised you had this problem during identifier loading, as the stacktrace seems to indicate. I’d certainly expect such a problem during entity loading, since, well, we load associations there, but why would association loading impact ID fetching in any way… ?

There’s something fishy here. I created [HSEARCH-5061] - Hibernate JIRA to investigate your problem. Would you be able to share your entity model and configuration, or at least show us what the identifier loading query looks like when you triggered mass indexing before your fix?

Sylvain · January 22, 2024, 8:57am

Hello, I’m not sure to give what you expect, but here is the model of the contact table we are indexing.

import java.util.HashSet;
import java.util.Set;

import javax.persistence.CascadeType;
import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.FetchType;
import javax.persistence.GeneratedValue;
import javax.persistence.GenerationType;
import javax.persistence.Id;
import javax.persistence.Index;
import javax.persistence.JoinColumn;
import javax.persistence.OneToMany;
import javax.persistence.OneToOne;
import javax.persistence.Table;
import javax.persistence.TableGenerator;
import javax.persistence.UniqueConstraint;

import org.apache.commons.lang3.StringUtils;
import org.apache.lucene.analysis.core.KeywordTokenizerFactory;
import org.apache.lucene.analysis.core.LowerCaseFilterFactory;
import org.hibernate.search.annotations.Analyzer;
import org.hibernate.search.annotations.AnalyzerDef;
import org.hibernate.search.annotations.DocumentId;
import org.hibernate.search.annotations.Field;
import org.hibernate.search.annotations.FullTextFilterDef;
import org.hibernate.search.annotations.Indexed;
import org.hibernate.search.annotations.IndexedEmbedded;
import org.hibernate.search.annotations.SortableField;
import org.hibernate.search.annotations.TokenFilterDef;
import org.hibernate.search.annotations.TokenizerDef;
import org.hibernate.search.filter.ShardSensitiveOnlyFilter;

import lombok.AccessLevel;
import lombok.NoArgsConstructor;

/**
 * Entité d'un contact
 */
@Entity
@Table(name = "CONTACT",
		uniqueConstraints = {
				@UniqueConstraint(name = "UK_CLIENT_CONTACT_ID", columnNames = { "CLIENT_ID", "CLIENT_CONTACT_ID" }) },
		indexes = {
				@Index(name = "IND_OPT_ID", columnList = "CLIENT_ID, ID"),
				@Index(name = "IND_OPT_NAME", columnList = "CLIENT_ID, NAME"),
				@Index(name = "IND_OPT_FIRST_NAME", columnList = "CLIENT_ID, FIRSTNAME"),
				@Index(name = "IND_OPT_COMPANY", columnList = "CLIENT_ID, COMPAGNY"),
				@Index(name = "IND_OPT_FUNCTION", columnList = "CLIENT_ID, MY_FUNCTION"),
				@Index(name = "IND_OPT_EXTERNAL_ID", columnList = "CLIENT_ID, EXTERNAL_ID"),
				@Index(name = "IND_OPT_NAME_FIRST_NAME", columnList = "CLIENT_ID, NAME, FIRSTNAME")
		})
@Indexed
@AnalyzerDef(name = Contact.DEFAULT_TOKENIZER, tokenizer = @TokenizerDef(factory = KeywordTokenizerFactory.class), filters = {
		@TokenFilterDef(factory = LowerCaseFilterFactory.class)
})
@FullTextFilterDef(name = Contact.SHARD_FILTER_NAME, impl = ShardSensitiveOnlyFilter.class)
public class Contact extends AbstractTimestampedEntity<Long> {
	
	/**
	 * Nom du filtre de sharding
	 */
	public static final String SHARD_FILTER_NAME = "contactShardFilter";
	public static final String DEFAULT_TOKENIZER = "defaultTokenizer";
	
	/**
	 * Tris possibles sur les contacts
	 */
	@NoArgsConstructor(access = AccessLevel.PRIVATE)
	public static class ORDER {
		public static final int ORDER_BY_ASC = 1;
		public static final int ORDER_BY_DESC = 2;
		
		/** Sort by Nom&Prenom */
		public static final int ORDER_BY_NAME = 21;
		/** Sort by Prenom&Nom */
		public static final int ORDER_BY_FIRSTNAME = 22;
		/** Sort by Societe&Nom */
		public static final int ORDER_BY_COMPAGNY = 23;
		/** Sort by Email&Nom */
		public static final int ORDER_BY_EMAIL = 24;
		/** Sort by Code&Nom */
		public static final int ORDER_BY_CODE = 25;
		/** Sort by Site&Nom */
		public static final int ORDER_BY_SITE = 26;
	}
	
	@TableGenerator(name = "ContactIdGenerator",
			table = "hibernate_sequences",
			pkColumnName = "sequence_name",
			valueColumnName = "next_val",
			pkColumnValue = "CONTACT_GEN", allocationSize = 100)
	
	/** Identifiant unique de l'entité */
	@Id
	@DocumentId
	@Column(name = "ID", nullable = false, updatable = false)
	@GeneratedValue(strategy = GenerationType.TABLE, generator = "ContactIdGenerator")
	private Long id;
	
	/** ID for contact which is unique regarding one client */
	@Column(name = "CLIENT_CONTACT_ID", nullable = false, updatable = false)
	private Long clientContactId;
	
	/** Identifiant du client */
	@Field
	@Column(name = "CLIENT_ID", nullable = false, updatable = false)
	private Integer clientId;
	
	@Field(analyzer = @Analyzer(definition = DEFAULT_TOKENIZER))
	@SortableField
	@Column(name = "NAME", length = 128, nullable = true)
	private String name;
	
	@Field(analyzer = @Analyzer(definition = DEFAULT_TOKENIZER))
	@SortableField
	@Column(name = "FIRSTNAME", length = 128, nullable = true)
	private String firstname;
	
	@Field(analyzer = @Analyzer(definition = DEFAULT_TOKENIZER))
	@SortableField
	@Column(name = "COMPAGNY", length = 64, nullable = true)
	private String compagny;
	
	@Field(analyzer = @Analyzer(definition = DEFAULT_TOKENIZER))
	@Column(name = "MY_FUNCTION", length = 64, nullable = true)
	private String function;
	
	/** Identifiant externe. ATTENTION il n'est pas unique et peut-être null */
	@Field(analyzer = @Analyzer(definition = DEFAULT_TOKENIZER))
	@Column(name = "EXTERNAL_ID")
	private String externalId;
	
	@IndexedEmbedded
	@OneToMany(mappedBy = "contact", cascade = { CascadeType.ALL }, fetch = FetchType.LAZY, orphanRemoval = true)
	private Set<DirectoryAddress> directoryAddresses = new HashSet<>();
	
	@OneToMany(cascade = { CascadeType.ALL }, fetch = FetchType.LAZY)
	@JoinColumn(name = "CONTACT_ID")
	private Set<InternalData> internalDatas = new HashSet<>();
	
	@OneToMany(cascade = { CascadeType.ALL }, fetch = FetchType.LAZY)
	@JoinColumn(name = "CONTACT_ID")
	private Set<ExternalData> externalDatas = new HashSet<>();
	
	@IndexedEmbedded(includePaths = { "number" })
	@OneToOne(cascade = { CascadeType.REFRESH })
	@JoinColumn(name = "ADDRESS_NUM_1_ID", referencedColumnName = "ID", insertable = true, updatable = true, nullable = true)
	private DirectoryAddress directoryAddressNumber1;
	
	@IndexedEmbedded(includePaths = { "email" })
	@OneToOne(cascade = { CascadeType.REFRESH })
	@JoinColumn(name = "ADDRESS_EMAIL_1_ID", referencedColumnName = "ID", insertable = true, updatable = true, nullable = true)
	private DirectoryAddress directoryAddressEmail1;
	
	@IndexedEmbedded(includePaths = { "code", "site" })
	@OneToOne(cascade = { CascadeType.REFRESH })
	@JoinColumn(name = "ADDRESS_POSTAL_1_ID", referencedColumnName = "ID", insertable = true, updatable = true, nullable = true)
	private DirectoryAddress directoryAddressPostal1;
	
	@OneToOne(cascade = { CascadeType.REFRESH })
	@JoinColumn(name = "CONTACTS_LIST_ID", referencedColumnName = "ID", insertable = true, updatable = true, nullable = true)
	@IndexedEmbedded
	private ContactsList contactsList;
	
	@Column(name = "CONTACTS_LIST_ID", updatable = false, insertable = false)
	private Long contactsListId;
	
	/** Permet de savoir si le contact est sur liste noir */
	@Column(name = "BLACKLIST", nullable = false)
	@Field
	private Boolean blacklist = false;

It contains a lot of relations with other tables that are indexed to enable search by several criteria.
May be our model is not the more efficient…

yrodiere · January 22, 2024, 9:55am

Thanks.

I see you’re on Hibernate Search 5. I didn’t get that from previous messages (though that was obvious in retrospect).

Hibernate Search 5 has a different implementation of this, so my concern doesn’t apply there. I’ll close the issue.

So… For now you have that fix and all’s good. Let us know if you encounter a problem like this again when you’ll upgrade to Hibernate Search 6+

Topic		Replies	Views
MassIndexer consistently stops at 684,000 Hibernate Search	4	1515	November 8, 2018
MassIndexer SocketTimeOut on purge Index with 25'000'000 entites Hibernate Search	1	627	May 30, 2023
Hibernate MassIndexer on Unstable DB Server Hibernate Search	4	1039	August 28, 2019
Index creation (Hibernate Search 5.11) (mass indexer) taking a long time Hibernate Search	4	920	February 14, 2022
MassIndexer parallel run with working indexes Hibernate Search	5	495	November 24, 2023

MassIndexer fails due to connection reset

Related topics