Handling Large Data Sets with Hibernate Search and OpenSearch: Seeking Long-Term Solutions

Hello Hibernate Search Community,

I am currently working with Hibernate Search to query large datasets from OpenSearch and have encountered a limitation with the size of the result window. My application needs to process potentially tens of millions of records stored in OpenSearch, but I am hitting the index.max_result_window limit, which is set to 10,000 by default.

Here’s how my current method to fetch data looks:

public Page<EntityDetails> findEntities(SearchRequest searchRequest) {
    var searchResult = entitySearchRepository.search(Entity.class)
            .where(searchRequest.toPredicate())
            .loading(searchLoadingOptionsStep -> entityGraphs.forEach(graph -> searchLoadingOptionsStep.graph(graph, GraphSemantic.FETCH)))
            .sort(searchRequest.toSort())
            .fetch(searchRequest.getOffset(), searchRequest.getLimit());

    var result = searchResult
            .hits()
            .stream()
            .map(mapper::map)
            .toList();
    return new PageImpl<>(result, PageRequest.of(searchRequest.getPage(), searchRequest.getSize()), searchResult.total().hitCount());
}

While attempting to retrieve data beyond the 10,000th record, I receive the following error:

Response: 400 'Bad Request' with body {
  "error": {
    "root_cause": [{
      "type": "illegal_argument_exception",
      "reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [36540]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting."
    }],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [{
      "shard": 0,
      "index": "entity-index-v1",
      "node": "R8TtsgiuQIKy9Tya8KWvmw",
      "reason": {
        "type": "illegal_argument_exception",
        "reason": "Result window is too large, from + size must be less than or equal to: [10000] but was [36540]. See the scroll api for a more efficient way to request large data sets."
      }
    }]
  },
  "status": 400
}

I understand that adjusting the index.max_result_window setting could temporarily resolve this issue. However, as our data continues to grow, this might not be a viable long-term solution since we could eventually exceed any statically set limit. I am looking for guidance on how to implement a more scalable approach, possibly using Hibernate Search’s (7.1.0 Final) Scroll API, to handle large datasets effectively.

Could anyone provide examples or insights on how to adjust my method to utilize the Scroll capabilities for handling very large data sets efficiently?

Thank you in advance for your help!

Hey,

Thanks for reaching out. That max_result_window limit in OpenSearch/Elasticsearch is there on purpose. To get a single “page”, documents from multiple shards have to be retrieved, resorted, and most of them discarded (look for deep paging in distributed systems if want to read more)

The scrolling API is more for the use cases where a report of all results has to be generated; it has a downside in that the search context has to be preserved so that you can get a consistent, uninterrupted sequence of documents.

So ideally, you’d want to provide more flexible filtering and sorting for pagination presented to the user to just look at a couple of results and use scrolling for generating some reports or feeding the data out of your OpenSearch cluster into some other system.

1 Like

Hey,

First, let’s clarify that querying a large dataset is not the problem here, since 10,000 documents is not much compared to the amount of data that OpenSearch can index.

The problem is deep pagination within that dataset, i.e. going beyond the 10,000th document within a single query. I think @mbekhta covered pretty much all there was to say about that: you should simply avoid presenting that many results to your users.

That being said, if you need to go that deep in the result, I suspect you’re not using a relevance/score sort because the 10,001st result is likely not relevant at all.
Instead, you’re probably sorting by field values… and that opens the door to another type of pagination: search_after. The idea is that instead of asking for the 10,001st result, you’ll be storing the “sort keys” (values of fields you’re sorting on) for the 10,000th result, and to fetch the next page you’ll simply ask for the “documents whose sort values are after those of the 10,000th result”; OpenSearch will automatically add a filter to the query, therefore bypassing the result window limit.

Note that, obviously, this solution does not provide random access to a given page: by itself it only allows (efficiently) fetching the next page. Depending on your use case, that might be enough.

Unfortunately, that type of pagination is not directly exposed in Hibernate Search (yet, see HSEARCH-2601).
But you can probably leverage it yourself using the Elasticsearch extension to the Hibernate Search DSL (warning, you’ll have to manipulate JSON :slight_smile: ).

1 Like

Thank you @mbekhta and @yrodiere

We decided to limit how many items users can paginate. We will wait for this search after feature :smile: