How to use annotations to read two files and index the connection into multiple fields

matthias-ronge · December 6, 2024, 1:45pm

I have the following Java code that generates our indexing words:

@Entity
@Indexed(index = "process")
@Table(name = "process")
public class Process extends BaseTemplateBean {

    // [... a lot of unrelated stuff ...]

    @Transient
    private transient IndexingKeyworder indexingKeyworder;

    // [... a bigger lot of unrelated stuff ...]

    @Transient
    @FullTextField(name = "search")
    @IndexingDependency(reindexOnUpdate = ReindexOnUpdate.NO)
    public String getKeywordsForFreeSearch() {
        return initializeKeywords().getSearch();
    }

    @Transient
    @FullTextField(name = "searchTitle")
    @IndexingDependency(reindexOnUpdate = ReindexOnUpdate.NO)
    public String getKeywordsForSearchingInTitle() {
        return initializeKeywords().getSearchTitle();
    }

    @Transient
    @FullTextField(name = "searchProject")
    @IndexingDependency(reindexOnUpdate = ReindexOnUpdate.NO)
    public String getKeywordsForSearchingByProjectName() {
        return initializeKeywords().getSearchProject();
    }

    @Transient
    @FullTextField(name = "searchBatch")
    @IndexingDependency(reindexOnUpdate = ReindexOnUpdate.NO)
    public String getKeywordsForAssignmentToBatches() {
        return initializeKeywords().getSearchBatch();
    }

    @Transient
    @FullTextField(name = "searchTask")
    @IndexingDependency(reindexOnUpdate = ReindexOnUpdate.NO)
    public String getKeywordsForSearchingForTaskInformation() {
        return initializeKeywords().getSearchTask();
    }

    private IndexingKeyworder initializeKeywords() {
        if (this.indexingKeyworder == null) {
            IndexingKeyworder indexingKeyworder = new IndexingKeyworder(this);
            this.indexingKeyworder = indexingKeyworder;
            return indexingKeyworder;
        } else {
            return indexingKeyworder;
        }
    }
}

These code use this helper class to work without duplicating the work in any case:

class IndexingKeyworder {
    private static final String PSEUDOWORD_TASK_AUTOMATIC = "automatic";
    private static final String PSEUDOWORD_TASK_DONE = "closed";
    private static final String PSEUDOWORD_TASK_DONE_PROCESSING_USER = "closeduser";
    private static final String ANY_METADATA_MARKER = "mdWrap";
    private static final char VALUE_SEPARATOR = 'q';
    private static final Pattern TITLE_GROUPS_PATTERN = Pattern.compile("[\\p{IsLetter}\\p{Digit}]+");
    private static final Pattern METADATA_PATTERN = Pattern.compile("name=\"([^\"]+)\">([^<]*)<", Pattern.DOTALL);
    private static final Pattern METADATA_SECTIONS_PATTERN = attern.compile("<mets:dmdSec.*?o>(.*?)</kitodo:k", Pattern.DOTALL);
    private static final Pattern RULESET_KEY_PATTERN = Pattern.compile("key id=\"([^\"]+)\">(.*?)</key>", Pattern.DOTALL);
    private static final Pattern RULESET_LABEL_PATTERN = Pattern.compile("<label[^>]*>([^<]+)", Pattern.DOTALL);

    private final Map<String, Map<String, Collection<String>>> rulesetCache = new HashMap<>();

    private Set<String> titleKeywords = Collections.emptySet();
    private Set<String> projectKeywords = Collections.emptySet();
    private Set<String> batchKeywords = Collections.emptySet();
    private Set<String> taskKeywords = Collections.emptySet();
    private Set<String> taskPseudoKeywords = Collections.emptySet();
    private Set<String> metadataKeywords = Collections.emptySet();
    private Set<String> metadataPseudoKeywords = Collections.emptySet();
    private String processId = null;
    private Set<String> commentKeywords = Collections.emptySet();

    public IndexingKeyworder(Process process) {
        this.titleKeywords = filterMinLength(initTitleKeywords(process.getTitle()));
        this.projectKeywords = filterMinLength(initSimpleKeywords(Objects.nonNull(process.getProject()) ? process.getProject().getTitle() : ""));
        this.batchKeywords = filterMinLength(initBatchKeywords(process.getBatches()));
        var taskKeywords = initTaskKeywords(process.getTasksUnmodified());
        this.taskKeywords = filterMinLength(taskKeywords.getLeft());
        this.taskPseudoKeywords = filterMinLength(taskKeywords.getRight());
        var metadataKeywords = initMetadataKeywords(process);
        this.metadataKeywords = filterMinLength(metadataKeywords.getLeft());
        this.metadataPseudoKeywords = filterMinLength(metadataKeywords.getRight());
        this.processId = process.getId().toString();
        this.commentKeywords = filterMinLength(initCommentKeywords(process.getComments()));
    }

    private static Set<String> initTitleKeywords(String processTitle) {
        Set<String> tokens = new HashSet<>();
        Matcher matcher = TITLE_GROUPS_PATTERN.matcher(processTitle);
        while (matcher.find()) {
            String normalized = normalize(matcher.group());
            final int length = normalized.length();
            for (int end = 1; end <= length; end++) {
                tokens.add(normalized.substring(0, end));
            }
            for (int beginning = length - 1; beginning >= 0; beginning--) {
                tokens.add(normalized.substring(beginning, length));
            }
        }
        return tokens;
    }

    private static final Set<String> initSimpleKeywords(String input) {
        Set<String> tokens = new HashSet<>();
        for (String term : splitValues(input)) {
            tokens.add(normalize(term));
        }
        return tokens;
    }

    private static final Set<String> initBatchKeywords(Collection<Batch> batches) {
        if (batches.isEmpty()) {
            return Collections.emptySet();
        }
        Set<String> tokens = new HashSet<>();
        for (Batch batch : batches) {
            String optionalTitle = batch.getTitle();
            if (StringUtils.isNotBlank(optionalTitle)) {
                tokens.addAll(initSimpleKeywords(optionalTitle));
            }
        }
        return tokens;
    }

    private static final Pair<Set<String>, Set<String>> initTaskKeywords(Collection<Task> tasks) {
        Set<String> taskKeywords = new HashSet<>();
        Set<String> taskPseudoKeywords = new HashSet<>();
        for (Task task : tasks) {
            for (String token : splitValues(task.getTitle())) {
                String term = normalize(token);
                taskKeywords.add(term);
                if (task.isTypeAutomatic()) {
                    taskKeywords.add(PSEUDOWORD_TASK_AUTOMATIC + VALUE_SEPARATOR + term);
                }
                TaskStatus taskStatus = task.getProcessingStatus();
                if (Objects.isNull(taskStatus)) {
                    continue;
                }
                if (Objects.equals(taskStatus, TaskStatus.DONE)) {
                    taskPseudoKeywords.add(PSEUDOWORD_TASK_DONE);
                    taskPseudoKeywords.add(PSEUDOWORD_TASK_DONE + VALUE_SEPARATOR + term);
                    User closedUser = task.getProcessingUser();
                    if (Objects.isNull(closedUser)) {
                        continue;
                    }
                    if (StringUtils.isNotBlank(closedUser.getName())) {
                        taskPseudoKeywords.add(PSEUDOWORD_TASK_DONE_PROCESSING_USER + VALUE_SEPARATOR + normalize(
                            closedUser.getName()));
                    }
                    if (StringUtils.isNotBlank(closedUser.getSurname())) {
                        taskPseudoKeywords.add(PSEUDOWORD_TASK_DONE_PROCESSING_USER + VALUE_SEPARATOR + normalize(
                            closedUser.getSurname()));
                    }
                } else {
                    String taskKeyword = taskStatus.toString().toLowerCase();
                    taskPseudoKeywords.add(taskKeyword);
                    taskPseudoKeywords.add(taskKeyword + VALUE_SEPARATOR + term);
                }
            }
        }
        return Pair.of(taskKeywords, taskPseudoKeywords);
    }

    private static final Pair<Set<String>, Set<String>> initMetadataKeywords(Process process) {
        final Pair<Set<String>, Set<String>> emptyResult = Pair.of(Collections.emptySet(), Collections.emptySet());
        try {
            String processId = Integer.toString(process.getId());
            Path path = Paths.get(KitodoConfig.getKitodoDataDirectory(), processId, "meta.xml");
            if (!Files.isReadable(path)) {
                return emptyResult;
            }
            String metaXml = FileUtils.readFileToString(path.toFile(), StandardCharsets.UTF_8);
            if (!metaXml.contains(ANY_METADATA_MARKER)) {
                return emptyResult;
            }
            Set<String> metadataKeywords = new HashSet<>();
            Set<String> metadataPseudoKeywords = new HashSet<>();
            Map<String, Collection<String>> rulesetLabelMap = getRulesetLabelMap(process.getRuleset().getFile());
            Matcher metadataSectionsMatcher = METADATA_SECTIONS_PATTERN.matcher(metaXml);
            while (metadataSectionsMatcher.find()) {
                Matcher keyMatcher = METADATA_PATTERN.matcher(metadataSectionsMatcher.group(1));
                while (keyMatcher.find()) {
                    String key = normalize(keyMatcher.group(1));
                    String valueString = keyMatcher.group(2);
                    for (String singleValue : splitValues(valueString)) {
                        String value = normalize(singleValue);
                        metadataKeywords.add(value);
                        metadataPseudoKeywords.add(key + VALUE_SEPARATOR + value);
                        metadataPseudoKeywords.add(key);
                        for (String label : rulesetLabelMap.getOrDefault(key, Collections.emptyList())) {
                            metadataPseudoKeywords.add(label + VALUE_SEPARATOR + value);
                            metadataPseudoKeywords.add(label);
                        }
                    }
                }
            }
            return Pair.of(metadataKeywords, metadataPseudoKeywords);
        } catch (IOException | RuntimeException e) {
            return emptyResult;
        }
    }

    private static Map<String, Collection<String>> getRulesetLabelMap(String file) {
        Map<String, Collection<String>> rulesetLabelMap = rulesetCache.get(file);
        if (Objects.nonNull(rulesetLabelMap)) {
            return rulesetLabelMap;
        }
        try {
            File rulesetFile = Paths.get(KitodoConfig.getParameter("directory.rulesets"), file).toFile();
            String ruleset = FileUtils.readFileToString(rulesetFile, StandardCharsets.UTF_8);
            rulesetLabelMap = new HashMap<>();
            Matcher keysMatcher = RULESET_KEY_PATTERN.matcher(ruleset);
            while (keysMatcher.find()) {
                String key = normalize(keysMatcher.group(1));
                Matcher labelMatcher = RULESET_LABEL_PATTERN.matcher(keysMatcher.group(2));
                Set<String> labels = new HashSet<>();
                while (labelMatcher.find()) {
                    labels.add(normalize(labelMatcher.group(1)));
                }
                rulesetLabelMap.put(key, labels);
            }
            rulesetCache.put(file, rulesetLabelMap);
            return rulesetLabelMap;
        } catch (IOException | RuntimeException e) {
            return Collections.emptyMap();
        }
    }

    private static final Set<String> initCommentKeywords(List<Comment> comments) {
        Set<String> tokens = new HashSet<>();
        for (Comment comment : comments) {
            String message = comment.getMessage();
            if (StringUtils.isNotBlank(message)) {
                tokens.addAll(initSimpleKeywords(message));
            }
        }
        return tokens;
    }

    private static String normalize(String string) {
        return string.toLowerCase().replaceAll("[\0-/:-`{-¿]", "");
    }

    private static List<String> splitValues(String value) {
        String initializedValue = value != null ? value : "";
        return Arrays.asList(initializedValue.split("[ ,\\-._]+"));
    }

    private static Set<String> filterMinLength(Set<String> tokens) {
        for (Iterator<String> iterator = tokens.iterator(); iterator.hasNext();) {
            if (iterator.next().length() < 3) {
                iterator.remove();
            }
        }
        return tokens;
    }

    public String getSearch() {
        Set<String> freeKeywords = new HashSet<>();
        freeKeywords.addAll(titleKeywords);
        freeKeywords.addAll(projectKeywords);
        freeKeywords.addAll(batchKeywords);
        freeKeywords.addAll(taskKeywords);
        freeKeywords.addAll(metadataKeywords);
        freeKeywords.addAll(metadataPseudoKeywords);
        if (Objects.nonNull(processId)) {
            freeKeywords.add(processId);
        }
        freeKeywords.addAll(commentKeywords);
        return String.join(" ", freeKeywords);
    }

    public String getSearchTitle() {
        return String.join(" ", titleKeywords);
    }

    public String getSearchProject() {
        return String.join(" ", projectKeywords);
    }

    public String getSearchBatch() {
        return String.join(" ", batchKeywords);
    }

    public String getSearchTask() {
        Set<String> allTaskKeywords = new HashSet<>();
        allTaskKeywords.addAll(taskKeywords);
        allTaskKeywords.addAll(taskPseudoKeywords);
        return String.join(" ", allTaskKeywords);
    }
}

I have now been accused of not using the Hibernate Sarch framework and the functionality needs to be provided via annotations. I am very open to not reinventing an existing function, so how can I do that?

The following requirements need to be preserved for performance reasons:

String metaXml must only be read once!
rulesetLabelMap must only be created once for a given file and use cache! (500k objects can be processed with the same file!)
all tokens must be normalized!
follow the special rules, title must be cut into character strings that are not letters or numbers, and can be searched one-way or backwards (but not both together) but there must be at least 3 characters
for the metadata search terms must be generated using the rule set (see Java code above)
and the various search terms must be included in the joint search (“search”) but not all (see Java code above) and also separated
and take into account that none of the caclulations are calculated twice!

Please let me know how I can implement this with the annotations offered by Hibernate Search guaranteed without any disadvantages!

mbekhta · December 10, 2024, 10:01am

Hey @matthias-ronge

I think that what you are looking for is Analysis.

From the code you’ve shared, it seems that you are manually transforming your text into keywords/tokens. With full-text search engines (e.g. Lucene/Elasticsearch), that’s usually handled through analysis. That chapter on analysis includes links to backends and instructions on how you can configure your custom analyzers. To find out which filters and tokenizers are available, see also:

matthias-ronge · December 11, 2024, 10:15am

I can understand that. My problem is more the annotations themselves. I have only known annotations doing something externally so far, so @Deprecated on functions leads to crossed out representation in Eclipse, and compiler warnings, the Hibernate annotations replace, as far as I understand, a configuration file that would otherwise have to be created by hand, but before the program starts. Here the annotations are active code that is executed during indexing, and I should move my active code into these annotations. I have a few questions about the process, but I want to ask them one at a time:

Question 1. (How) can I index a field into two fields?
At the moment I have (simplified) the following:

@Transient
@FullTextField(name = "searchTitle")
public String getKeywordsForSearchingInTitle() {
    return String.join(" ", titleKeywords);
}

@Transient
@FullTextField(name = "search")
public String getKeywordsForFreeSearch() {
    Set<String> freeKeywords = new HashSet<>();
    freeKeywords.add(titleKeywords);
    // ... add other keywords to “free keywords”
    return String.join(" ", freeKeywords);
}

Can I get the title keywords in two search fields by using annotations? What I am looking for is something like

@Transient
@FullTextField(name = Arrays.asList("search", "searchTitle"))
public String getKeywordsForSearchingInTitle() {
    return String.join(" ", titleKeywords);
}

(The above will not work because name is String. It is just to show what I would like to find.) Or maybe:

@Transient
@FullTextField(name = "search")
@FullTextField(name = "searchTitle")
public String getKeywordsForSearchingInTitle() {
    return String.join(" ", titleKeywords);
}

Are annotations repeatable?
Does the order of the annotations matter?
If I need the same analyzer for both, can I prevent that the work must be done twice?
Can several annotations safely index into the same field or may this cause problems that I should be aware of?

Question 2. Which analyzers are there?

In the documentation I only ever see "name" and "english".

My title keywords are formed by splitting the title into lowercase letters with numbers, and then each part from the front or back but at least three characters. For the title “Alice Bob&Charly” this would be “ali aic alice lice ice bob cha char charl charly harly arly rly”. The question is divided into sub-questions: Is such functionality already available in the Hibernate Search package so that I don’t have to program it again? The annotation-named variable analyzer is a string, what should I enter here? The links you kindly provided point to individual analyzer APIs from ElasticSearch and Lucene. I don’t want that, I want to do it platform-independently and inside this application itself. Is this some programming language (like Java Expression Language, or BeanShell) where I can do something fancy:

@FullTextField(name = "searchTitle",
    analyzer="var v = #{toLowerCase(split('[\\p{IsLetter}\\p{Digit}]+', #{title}))};" +
             "return leftTruncate(minGlyphs = 3, value = var('v')).addAll(rightTruncate(minGlyphs = 3, value = var('v')))")
private String title;

If so, what language, is there documentation for it, what functions are available?

Question 3. If I need to code my own analyzer, how do I do it?

@FullTextField(name = "searchTitle", analyzer="titleTokenizer")
private String title;

What must the skeleton for it look like? What I am looking for is something like:

public @ titleTokenizer implements analyzer<@FullTextField> {
    @Override
    public String apply(String title) {
        // ...
        return String.join(titleKeywords, " ");
    }
}

Or mabybe something as simple as:

@AnalyzerDecl(id = "titleAnalyzer", input = String.class)
public static String formTitleKeywords(String title) {
    // ...
    return String.join(titleKeywords, " ");
}

Does it have to be a class, an annotation, a static function? Does it have to be in a specific package or something else notable?

Question 4. Minimal example for indexing a referenced value?

Very briefly, if I want to index a single value from a linked class into a field in that class, for example, how does that have to look? The other class will not be indexed, and if it were to be in the future, the annotation there should not interfere with the indexing of that other class (if that is possible). I want to be able to search for ‘Job’ by the ‘job.project.getTitle()’. This one should work I guess:

@Transient
@FullTextField(name = "searchProject", analyzer="titleTokenizer")
public String getProjectTitle() {
    return project.getTitle();
}

But can I do this just with annotations? I see the examples of @Embedded and @IndexEmbedded but I don’t understand it. I feel unsafe putting an annotation @FullTextField on class ‘Project’: If we later decide to index ‘Project’ and the field should not be indexed there, or should be indexed differently, that would collide. What I would prefer something like this, but if I can’t it is okay as well. What should I do?

@Entity
@Indexed(index = "job")
@Table(name = "job")
public class Job {
    @ManyToOne
    @JoinColumn(name = "project_id", foreignKey = @ForeignKey(name = "FK_job_project_id"))
    @FullTextField(name = "search" & "searchProject", analyzer = "titleTokenizer",
        @IndexEmbedded = Project::getTitle);
    private Project project;
}

@Entity
@Table(name = "project")
public class Project {
    @Column(name = "title", nullable = false, unique = true)
    private String title;
}

Thank you for any enlightenment, as I am curious to understand and implement this.

mbekhta · December 11, 2024, 11:31am

Hey. Please see the answers inline:

Do you need to do so? From your example, you want to have a title and anall fields in the index where all combines text from multiple fields. In most cases, you can simply target multiple fields when you want to perform a search operation: Hibernate Search 7.2.2.Final: Reference Documentation

yes, you can have as many as you need; the point of having multiple @***Field annotations is to be able to create index fields with different configurations.

no

no, each index field will apply its analyzer when indexed

no, if you try mapping two properties to the same index field, you will get an exception at startup.

Your application has to be using one of the backends, either the Lucene or the Elasticsearch one. The configuration of the analyzers happen within your application, e.g. for an Elasticsearch backend: Hibernate Search 7.2.2.Final: Reference Documentation . Those links to ES/Lucene docs I’ve shared previously will help you see what filters and tokenizers are there and from there you can see how to combine them to get the result you look for.

One creates a custom analyzer by combining a set of character filters, tokenizer and token filters , That’s done through the analysis configurer (depending on the backend: Hibernate Search 7.2.2.Final: Reference Documentation / Hibernate Search 7.2.2.Final: Reference Documentation ). If you want to create custom tokenizers/filters … It’s easier with the Lucene backend, see their docs and implement required interfaces, more complex with the Elasticsearch backend, as you’d probably would need to create some custom plugins to get there… but before you start exploring that you should go through the existing filters as those most likely already cover your needs.

If Project is not indexed – nothing to worry about just add the annotation on the title there and use @IndexEmbedded on project field within the Job class. If later you decide to index Project as well and will need a different analyzer applied for that case – you can always add another @FullTextField to create additional index fields.

And be careful about doing things like:

See this section of the docs: Hibernate Search 7.2.2.Final: Reference Documentation

Topic		Replies	Views
Create indexes for transient fields Hibernate Search	4	201	March 28, 2024
Hibernate Search 6 - Multi-Field Support Hibernate Search	2	1493	September 25, 2019
Creating two seperate indexes Hibernate Search	5	548	May 17, 2022
Hibernate Search 6 - computed mapping Hibernate Search	2	577	July 26, 2019
HS 6 and referencedColumnName Hibernate Search	3	1210	February 12, 2021

How to use annotations to read two files and index the connection into multiple fields

Related topics