Indexing seems to be doing too much, is there anything I can do to optimize it?

I have added Hibernate Search to our web application. It is supposed to index Processes, it works but indexing is very slow. At the same time it creates high load. In processes: Application (Tomcat) 50%, MySql 35%, and in services: Mariadb 30%, Windows Search 10% (according to Windows Resource Monitor). Indexing 10k objects took a whole night. I have the feeling that the framework is doing far more than it would need to.

My bean class Process to be indexed has a field id (the ID column of the database) and transient getter methods for the index words. Simplified, it is this:

@MappedSuperclass
public abstract class BaseBean implements Serializable {

    @Id
    @Column(name = "id")
    @GenericField
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    protected Integer id;
}
@MappedSuperclass
public abstract class BaseTemplateBean extends BaseBean {
    // ... 3 fields, not to be indexed ...
}
@Entity
@Indexed(index = "process")
@Table(name = "process")
public class Process extends BaseTemplateBean {

    // ... 30 fields, with complex dependencies, but nothing to be indexed ...

    @Transient
    @FullTextField(name = "search")
    @IndexingDependency(reindexOnUpdate = ReindexOnUpdate.NO)
    public String getKeywordsForFreeSearch() {
        return kewordsFromFileFor(id); // returns space-separated list of
                                       // indexing terms like
                                       // "dog house treatment bell ...
    }
}

You can see the whole class here: kitodo-production/Kitodo-DataManagement/src/main/java/org/kitodo/data/database/beans/Process.java at a21e0f3f2966d5a36ac4e5a15fa7df759f15e812 · kitodo/kitodo-production · GitHub

Then I search for the ID like this (it works fine):

    public Collection<Integer> searchIds(Class<? extends BaseBean> beanClass,
            String searchField, String value) {

        SearchSession searchSession = Search.session(getSession());
        SearchProjection<Integer> idField = searchSession.scope(beanClass)
                .projection().field("id", Integer.class).toProjection();
        List<Integer> ids = searchSession.search(beanClass).select(idField)
            .where(function -> function.match().field(searchField)
                .matching(value)).fetchAll().hits();
        return ids;
    }

My assumption would be that the content from Process will be indexed to a document with an internal indexing number, and the indexed document only contains the ID.

dog → 1234, house → 1234, treatment → 1234, bell → 1234, …

1234 → { id: 42 }

And that should be much faster and much less load, especially on the database virtually no load. I think the framework does a lot in the background, but I don’t know what or why. Is it related to those annotations? Does it load the entire object with all the lazy-loading collections and objects on it to check if there is an annotation, just to notice: oh, there isn’t one, I can forget about this. Can I improve something here? Can I / do I have to tell the framework: You don’t have to look here, there’s nothing to do? Or is it trying some extended evaluation on the @FullTextField aside from splitting it at spaces? Do I have to disable something?

Any suggestions will be appreciated.

Hey,

Thanks for reaching out. In the snippets you’ve shared I can’t see how do you index the entities initially… Do you use massindexer or other means to populate the index?

The search query you’ve shared uses projection, so no entity loading happens when you execute it i.e. the database is not accessed at this point (or at least it shouldn’t be).

No, the document schema is created on the start and Hibernate Search knows what the document is suppose to look like and which fields are required to create that document, so it does not go through entity graph each time to see whether there’s some annotation somewhere.
Now, if you are using transient getters that access the entity fields and some lazy associations then the calls to those getters would load data as you’d expect.

I’m not entirely sure what do you mean by this… Hibernate Search would take your text data and pass it to the search backend (Lucene/Elasticsearch) which will run analysis of that data and create the index document from it. So Hibernate Search does not perform any analysis, it delegates that to the backend.

This looks a bit concerning… do you actually read keywords from files each time the getter is called? If so, that might be your culprit.

Another thing I’d suggest is trying to enable query logging and running the indexing (and no other processes) to see what queries are executed. Those, in general, should be as if you’d be loading a list of entities by their IDs and access fields you marked for indexing.

Yes, I start it using Mass Indexer.

    public CompletionStage<?> startIndexing(Class<? extends BaseBean> type, MassIndexingMonitor monitor) {
        MassIndexer massIndexer = Search.session(HibernateUtil.getSession()).massIndexer(type);
        massIndexer.dropAndCreateSchemaOnStart(true);
        if (Objects.nonNull(monitor)) {
            massIndexer.monitor(monitor);
        }
        return massIndexer.start();
    }

Thank you for your response. Then I must search elsewhere. Maybe it’s something else, or my expectations just are too far off from reality.

Or is it trying some extended evaluation on the @FullTextField aside from splitting it at spaces?

I’m not entirely sure what do you mean by this…

I did read that @FullTextField is only split at spaces, and no other processing, no linguistic guessing etc. takes place. (And I assume Hibernate Search instructs the chosen search back-end to do just this and nothing more.) The deeper question was if I might have to add another annotation to tell the system not to try to guess or process language here (because the input isn’t).

BTW the reading of the file is cached and only takes ms. My main worry is that why the load on the Tomcat, Mysql and MariaDB are so high which surprises me. Let’s see if I find something out, or if I was just coming with wrong expectations …

Now I removed everything of the keywords indexing. Litreally, my beans are for indexing like

@GenericField
Integer id;

No more other indexed fields.

It doesn’t bring any improvement, still the same slow.

I’ve now started using Wireshark to see what’s going on, and the good news is that nothing is being sent to the index that shouldn’t be.

Now I am trying to enable logging on MariaDB database server. Posts say I must set general_log in my ini or cnf file. I tried but cannot set it in HeidiSQL, because “root” is not allowed to. So, I found 4 files named my ini:

C:>dir /s /b my.ini
C:\ProgramData\MySQL\MySQL Server 5.7\my.ini
C:\ProgramData\MySQL\MySQL Server 8.0\my.ini
C:\Users\All Users\MySQL\MySQL Server 5.7\my.ini
C:\Users\All Users\MySQL\MySQL Server 8.0\my.ini

I changed my inifile in C:\ProgramData\MySQL\MySQL Server 8.0, there was a line general-log=0 (with dash) to 1, and added a line general_log=1 with underscore, and after save I got the message that I must start the editor again in system mode, and I edited the file another time and was allowed to save it. Then I went to services, restarted Maria DB. No file as the name in general_log_file="MIK-NB-10691.log" is created:

C:\>dir /S /B MIK-NB-10691.log
Datei nicht gefunden

There is neither my ini nor my cnf in C:\Program Files\MariaDB 10.10. Nor any other *.ini or *.cnf file.

dir found a folder C:\Users\All Users\MySQL\ but there is no folder All Users in Users. (Showing hidden files is enabled.):

However, I can open C:\Users\All Users\MySQL\my.ini in the editor, and my changes to C:\ProgramData\MySQL\MySQL Server 8.0\my.ini have been added there. How ever that magic does happen.

But even after running the indexing again, there is still no logfile:

C:\>dir /S /B MIK-NB-10691.log
Datei nicht gefunden

It’s a pity that logging for MariaDB is so difficult to enable on Windows. It would have been helpful to me.

I’ve now tried to look at it in Wireshark, but there’s so much and the text is hard to read in the hex editor.

I will try to understand it further but, can anyone tell me how to turn on the log in Windows so that I can continue here?

I’ve now found this works:

SET GLOBAL general_log_file='D:\\usr\\local\\mariadb.log';
SET GLOBAL log_output = 'FILE';
SET GLOBAL general_log = 'ON';
SHOW VARIABLES LIKE "general_log%";

But the log is huge. I must look into it more in detail.

I think I now understand what the problem is. Hibernate automatically loads a lot of referenced objects for each object to be indexed, in a single SQL query with a bunch of joins. I’ve read why this is the case. The dependent objects are quite complex in our case, creating load both on Tomcat and the database. For each of the 100k’s of objects to index the same hand-full of complex related objects is loaded each time again.

What I would need here is for Hibernate to first load just the simple object, just with the foreign key ID of the related object, then look in memory to see if it has already loaded the referenced object (and its referenced objects, and their referenced objects, etc.). I assume at least, hibernate holds a map of weak references for all loaded object IDs in a session. If it finds the object has already been loaded, it sets this object, and only if it doesn’t have the object yet it loads it. Then, of course, this requires its own SQL query.

Is there any way to do this?

Edit: This are ManyToOne mappings