Outbox-Polling Coordination: Agents Stop Sending Heartbeats and Die

Hi everyone,

I am using Spring Boot (version 3.3.2) with Hibernate Search (version 7.2.0.Final) and Elasticsearch in my project. My current setup involves the outbox-polling coordination strategy to handle indexing operations asynchronously. While the setup initially works as expected, I am facing a critical issue with the agents.

Issue Description

Here’s the behavior I’ve observed:

  1. When the application starts or an instance restarts (I have 4 application nodes):
    • All agents are created successfully.
    • The hsearch_outbox_event table is processed, and all pending events are cleared.
    • The documents are correctly indexed in Elasticsearch.
  2. After 1-2 minutes of running:
    • One by one, the agents stop sending heartbeats.
    • Eventually, the agents are marked as expired and removed from the Agent table.
    • The last agent expired too base last heartbeat, but not removed from the Agent table and still running state.
    • No new agents are created to replace the expired ones, and indexing halts.

Configuration Details

Hibernate Search Settings

spring.jpa.properties.hibernate.search.coordination.strategy=outbox-polling
spring.jpa.properties.hibernate.search.coordination.event_processor.enabled=true
spring.jpa.properties.hibernate.search.coordination.event_processor.polling_interval=300
spring.jpa.properties.hibernate.search.coordination.event_processor.pulse_interval=10000
spring.jpa.properties.hibernate.search.coordination.event_processor.pulse_expiration=120000
spring.jpa.properties.hibernate.search.coordination.event_processor.batch_size=50
spring.jpa.properties.hibernate.search.coordination.event_processor.transaction_timeout=60
spring.jpa.properties.hibernate.search.coordination.event_processor.retry_delay=30

Database Settings

spring.datasource.hikari.maximum-pool-size=50 
spring.datasource.hikari.minimum-idle=25 
spring.datasource.hikari.max-lifetime=1800000 
spring.datasource.hikari.idle-timeout=60000 
spring.datasource.hikari.validation-timeout=5000 
spring.transaction.default-timeout=10 
spring.transaction.rollback-on-commit-failure=true

Questions

  1. What could cause the agents to stop sending heartbeats after running for a short period?
  2. How can I troubleshoot and resolve this issue to ensure agents continue to work correctly over time?

Additional Context

  • Database: PostgreSQL 14.11
  • Hibernate Search Version: 7.2.0.Final
  • Elasticsearch Version: 7.16.3
  • Logs:
    • I have enabled TRACE logging for org.hibernate.search and logging.level.org.hibernate.search.elasticsearch.request. No critical errors are observed, but I cannot determine why the agents are failing.
  • Relevant Observations:
    • No immediate resource exhaustion (CPU/memory) is evident on the nodes.
    • The database connection pool seems sufficient (hikari.maximum-pool-size=50).

I would greatly appreciate any guidance on debugging or potential misconfigurations. Thank you for your help!

Hey,

thanks for reaching out. It’s hard to tell where the problem is in your case without some reproducer :confused:. Do you see

Scheduling task 'Outbox event processor'.

in your logs after the pulse:

Agent 'smth - Outbox event processor': starting pulse at ...........

(If an outbox event task is scheduled, it will then start by sending a pulse once the task starts executing)

Everything seems to work correctly initially. I can see logs for scheduling tasks, running tasks, processing tasks and completed tasks as expected after pulse log. However, after some time, the behavior changes abruptly.

I start seeing the following log repeatedly:

Processing Transaction's afterCompletion() phase for org.hibernate.engine.transaction.internal.TransactionImpl@... Executing indexing plan.

After this, no new tasks are scheduled. Just see this log and afterCompletion() log:

Persisted 1 outbox events: '[OutboxEvent{id=...}]'

It seems like the agent is no longer functioning properly at this point. Could this be related to transaction timeouts, deadlocks, or something else? Any advice on diagnosing or resolving this would be greatly appreciated.

I found something that might be helpful in diagnosing the issue. My application and agents work correctly in the sandbox environment, but in the production environment, I am experiencing problems.

After some investigation, I noticed a key difference between the two environments:

  • In the sandbox environment, the PostgreSQL idle timeout policy (idle_in_transaction_session_timeout) is set to unlimited.
  • In the production environment, the PostgreSQL idle timeout policy is set to 5 minutes.

To address this, I added the following configuration in the production environment to keep the connections alive:

spring.datasource.hikari.keepalive-time=20000

However, this did not resolve the issue, and the agents still stop functioning after some time.

Any insights or suggestions on why this configuration might not be sufficient or what else I should look into would be greatly appreciated. Thank you!

I simulated this timeout policy in the sandbox environment, and the application worked correctly. However, the issue still exists.

Maybe take a thread dump when you see the application starts failing? If a thread is stuck, that should tell us where, at least. Though of course it won’t help if nothing is actually running…

Also, if you want useful advice, the logs would help :slight_smile:

After testing all scenarios, I found that the issue occurs when connecting to a PostgreSQL cluster via PgPool-II (port 9999). However, when connecting to a single-node PostgreSQL instance, everything works as expected. Has anyone else encountered this issue or found a solution for it?

My PostgreSQL architecture consists of one primary node for write queries and two standby nodes for read queries. (master-slave architecture)

From my understanding, when a Hibernate Search agent processes the hsearch_outbox_event table, it might involve multiple queries in a single session. For example:

  1. A SELECT query to fetch events from hsearch_outbox_event.
  2. Processing the data (e.g., persisting it in Elasticsearch).
  3. A DELETE query to remove the processed events from the table.

If my understanding is correct, all of these operations occur within a single session.

Here’s where the issue arises:
When connected through PgPool-II, the session is sometimes passed to a standby node (due to load balancing). If the session starts on a standby node, the DELETE query fails because standby nodes are read-only.

I tested enabling the statement_level_load_balancing flag in PgPool-II, which balances queries at the statement level. While this improved agent stability (agents live longer), the functionality is still unreliable. Occasionally, agents still fail. In documentation tell it is possible to decide load balancing node per query, not per session.

It seems that Hibernate Search agents don’t work correctly with PgPool-II and master-slave PostgreSQL architecture.

If the session starts on a standby node, the DELETE query fails because standby nodes are read-only.

That’s interesting. Is that failure silent? If an exception is raised, I’d expect the agent to catch it and give up as a result. Do you have a stack trace?

BTW @mbekhta it seems this code will retry forever: hibernate-search/mapper/orm-outbox-polling/src/main/java/org/hibernate/search/mapper/orm/outboxpolling/event/impl/OutboxPollingEventProcessor.java at 97841fb3a907d88ef738c206385a9b625217b75f · hibernate/hibernate-search · GitHub
We might want to add an upper limit to the number of iteration OR the time spent in that loop, after which we give up, log an error and assume the events will get processed a second time later (which only hurts performance, not consistency)…

It seems that Hibernate Search agents don’t work correctly with PgPool-II and master-slave PostgreSQL architecture.

I confirm Hibernate Search’s outbox-polling agents require write access to the database, if that’s what you mean.

When using PgPool-II, how do you handle this problem in “normal” sessions? E.g. when you start a transaction, run a few select queries, then an insert/update/delete query, how do you make sure the connection you’re using provides write access?

I wasn’t aware of this specific issue. In normal sessions, it’s possible that Hibernate or Spring Boot might handle this scenario, or perhaps PgPool-II has some mechanism to identify the required connection type.

However, I’ve observed a related issue. When the agent is running, the TPS (transactions per second) is around 40-50. But when the agent stops, the TPS increases significantly to 500. This suggests that the agent might be silently retrying operations multiple times in the background. We don’t see any exceptions because the retries are handled in the background without raising errors.

If the vast majority of connections are not providing write access, I could see the agents opening connections, starting transactions, locking on tables, failing and retrying – thereby causing a severe hit to the overall system performance.
However, I’d at least expect something in the logs to tell you the connections are incorrectly set up.

IMO there are two things here:

  • if connections are not providing write access, we should log the corresponding errors, to make this easier to debug.
  • if the “transaction per second” hit you’re seeing happens even when connections are correctly providing write access, we should investigate this as a performance bug.

That being said, once again, we’d need logs with the TRACE category enabled for org.hibernate in order to be any help at all.

Of course if you could provide a working reproducer (e.g. one that can easily be set up locally with docker-compose), we could be even more useful and try to figure out a patch.