I am using Spring Boot (version 3.3.2) with Hibernate Search (version 7.2.0.Final) and Elasticsearch in my project. My current setup involves the outbox-polling coordination strategy to handle indexing operations asynchronously. While the setup initially works as expected, I am facing a critical issue with the agents.
Issue Description
Here’s the behavior I’ve observed:
When the application starts or an instance restarts (I have 4 application nodes):
All agents are created successfully.
The hsearch_outbox_event table is processed, and all pending events are cleared.
The documents are correctly indexed in Elasticsearch.
After 1-2 minutes of running:
One by one, the agents stop sending heartbeats.
Eventually, the agents are marked as expired and removed from the Agent table.
The last agent expired too base last heartbeat, but not removed from the Agent table and still running state.
No new agents are created to replace the expired ones, and indexing halts.
What could cause the agents to stop sending heartbeats after running for a short period?
How can I troubleshoot and resolve this issue to ensure agents continue to work correctly over time?
Additional Context
Database: PostgreSQL 14.11
Hibernate Search Version: 7.2.0.Final
Elasticsearch Version: 7.16.3
Logs:
I have enabled TRACE logging for org.hibernate.search and logging.level.org.hibernate.search.elasticsearch.request. No critical errors are observed, but I cannot determine why the agents are failing.
Relevant Observations:
No immediate resource exhaustion (CPU/memory) is evident on the nodes.
The database connection pool seems sufficient (hikari.maximum-pool-size=50).
I would greatly appreciate any guidance on debugging or potential misconfigurations. Thank you for your help!
Everything seems to work correctly initially. I can see logs for scheduling tasks, running tasks, processing tasks and completed tasks as expected after pulse log. However, after some time, the behavior changes abruptly.
I start seeing the following log repeatedly:
Processing Transaction's afterCompletion() phase for org.hibernate.engine.transaction.internal.TransactionImpl@... Executing indexing plan.
After this, no new tasks are scheduled. Just see this log and afterCompletion() log:
It seems like the agent is no longer functioning properly at this point. Could this be related to transaction timeouts, deadlocks, or something else? Any advice on diagnosing or resolving this would be greatly appreciated.
I found something that might be helpful in diagnosing the issue. My application and agents work correctly in the sandbox environment, but in the production environment, I am experiencing problems.
After some investigation, I noticed a key difference between the two environments:
In the sandbox environment, the PostgreSQL idle timeout policy (idle_in_transaction_session_timeout) is set to unlimited.
In the production environment, the PostgreSQL idle timeout policy is set to 5 minutes.
To address this, I added the following configuration in the production environment to keep the connections alive:
spring.datasource.hikari.keepalive-time=20000
However, this did not resolve the issue, and the agents still stop functioning after some time.
Any insights or suggestions on why this configuration might not be sufficient or what else I should look into would be greatly appreciated. Thank you!
Maybe take a thread dump when you see the application starts failing? If a thread is stuck, that should tell us where, at least. Though of course it won’t help if nothing is actually running…
Also, if you want useful advice, the logs would help
After testing all scenarios, I found that the issue occurs when connecting to a PostgreSQL cluster via PgPool-II (port 9999). However, when connecting to a single-node PostgreSQL instance, everything works as expected. Has anyone else encountered this issue or found a solution for it?
My PostgreSQL architecture consists of one primary node for write queries and two standby nodes for read queries. (master-slave architecture)
From my understanding, when a Hibernate Search agent processes the hsearch_outbox_event table, it might involve multiple queries in a single session. For example:
A SELECT query to fetch events from hsearch_outbox_event.
Processing the data (e.g., persisting it in Elasticsearch).
A DELETE query to remove the processed events from the table.
If my understanding is correct, all of these operations occur within a single session.
Here’s where the issue arises:
When connected through PgPool-II, the session is sometimes passed to a standby node (due to load balancing). If the session starts on a standby node, the DELETE query fails because standby nodes are read-only.
I tested enabling the statement_level_load_balancing flag in PgPool-II, which balances queries at the statement level. While this improved agent stability (agents live longer), the functionality is still unreliable. Occasionally, agents still fail. In documentation tell it is possible to decide load balancing node per query, not per session.
It seems that Hibernate Search agents don’t work correctly with PgPool-II and master-slave PostgreSQL architecture.