Index creation (Hibernate Search 5.11) (mass indexer) taking a long time

Hi All,

I have around 2.1 million records that i am trying to creating an index on. I’ve been playing around with tuning (i.e threads and batch size). No matter the setting, after a few hours it settles into 20 documents a second.

I had the threads at 30 (my pool size was 40 min 100 max), but kept getting connection resets / or no JDBC connections. So i reduced it to 10 to stop errors.

public void buildIndex() {
        FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager);
        try {
            fullTextEntityManager
                .createIndexer( Interaction.class )
                .batchSizeToLoadObjects( 100 )
                .cacheMode( CacheMode.IGNORE )
                .threadsToLoadObjects( 10 )
                .idFetchSize( 300 )
                .transactionTimeout( 691200 )
                //.progressMonitor( monitor ) //a MassIndexerProgressMonitor implementation
                .startAndWait();
        } catch (InterruptedException e) {
            logger.error("Caught Exception: ", e);
        }
    }

Example speed at start:

2022-02-11 22:55:15.631  INFO 96088 --- [ntifierloader-1] o.h.s.b.i.SimpleIndexingProgressMonitor  : HSEARCH000027: Going to reindex 2135095 entities

2022-02-11 23:02:18.521  INFO 96088 --- [ entityloader-3] o.h.s.b.i.SimpleIndexingProgressMonitor  : HSEARCH000030: 21450 documents indexed in 403867 ms
2022-02-11 23:02:18.522  INFO 96088 --- [ entityloader-3] o.h.s.b.i.SimpleIndexingProgressMonitor  : HSEARCH000031: Indexing speed: 53.111546 documents/second; progress: 1.00%

Then after a few hours:

2022-02-12 09:17:34.491  INFO 96088 --- [ entityloader-5] o.h.s.b.i.SimpleIndexingProgressMonitor  : HSEARCH000030: 851900 documents indexed in 37319836 ms
2022-02-12 09:17:34.491  INFO 96088 --- [ entityloader-5] o.h.s.b.i.SimpleIndexingProgressMonitor  : HSEARCH000031: Indexing speed: 22.827003 documents/second; progress: 39.90%

I did initially test on a smaller data set (same entities etc) of around 100k, i was able to get through that in 10 mins. And that was around 100 documents a second.

Am i missing something here? or can anything be done to speed this up?

Here are the indexed entities (they are pretty big, but only have a few fields), ive not included getters / setters

@Entity
@Indexed
@NormalizerDef(name = "lowercase", filters = {
	@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
	@TokenFilterDef(factory = LowerCaseFilterFactory.class)
}
)
@Table(name = "Interaction")
public class Interaction  {

	private static Logger logger = LoggerFactory.getLogger(Interaction.class);

	@Id
	@Column(name="Id")
	private String id;

	@Field
	@Column(name="Status")
	private Short status;

	@Column(name="EntityTypeId")
	private Short entityTypeId;	

	@Column(name="MediaTypeId")
	private String mediaTypeId;

	@Column(name="TypeId")
	private String typeId;

	@Lob
	@Column(name="AllAttributes")
	private byte[] allAttributes;

	@Column(name="CanBeParent")
	private Boolean canBeParent;

	@Column(name="CategoryId")
	private String categoryId;

	@Column(name="ContactId")
	private String contactId;

	@Column(name="CreatorAppId")
	private Integer creatorAppId;

	@Field
	@Column(name="EndDate")
	@DateBridge(resolution=Resolution.SECOND)
	private Timestamp endDate;

	@Column(name="ExternalId")
	private String externalId;

	@Column(name="IntAttribute1")
	private Integer intAttribute1;

	@Column(name="IntAttribute2")
	private Integer intAttribute2;

	@Column(name="IntAttribute3")
	private Integer intAttribute3;

	@Column(name="IntAttribute4")
	private Integer intAttribute4;

	@Column(name="IntAttribute5")
	private Integer intAttribute5;

	@Column(name="IsCategoryApproved")
	private Boolean isCategoryApproved;

	@Column(name="IsSpam")
	private Boolean isSpam;

	@Column(name="Lang")
	private String lang;

	@Column(name="ModifiedDate")
	@DateBridge(resolution=Resolution.SECOND)
	private Timestamp modifiedDate;

	@Column(name="OwnerId")
	private Integer ownerId;

	@Column(name="ParentId")
	private String parentId;

	@Column(name="QueueName")
	private String queueName;

	@Field
	@SortableField
	@Column(name="StartDate")
	@DateBridge(resolution=Resolution.SECOND)
	private Timestamp startDate;

	@Column(name="StoppedReason")
	private String stoppedReason;

	@Column(name="StrAttribute1")
	private String strAttribute1;

	@Column(name="StrAttribute10")
	private String strAttribute10;

	@Column(name="StrAttribute2")
	private String strAttribute2;

	@Column(name="StrAttribute3")
	private String strAttribute3;

	@Column(name="StrAttribute4")
	private String strAttribute4;

	@Column(name="StrAttribute5")
	private String strAttribute5;

	@Column(name="StrAttribute6")
	private String strAttribute6;

	@Column(name="StrAttribute7")
	private String strAttribute7;

	@Column(name="StrAttribute8")
	private String strAttribute8;

	@Column(name="StrAttribute9")
	private String strAttribute9;

	@Column(name="StructTextMimeType")
	private String structTextMimeType;

	@Column(name="StructuredText")
	private String structuredText;

	@Field
	@Column(name="Subject")
	private String subject;

	@Column(name="SubTenantId")
	private Integer subTenantId;

	@Column(name="SubtypeId")
	private String subtypeId;

	@Column(name="TenantId")
	private Integer tenantId;
	
	@Field
	@Column(name="Text")
	private String text;

	@Field
	@Column(name="TheComment")
	private String theComment;

	@Column(name="ThreadHash")
	private Integer threadHash;

	@Column(name="ThreadId")
	private String threadId;

	@Column(name="Timeshift")
	private Short timeshift;

	@Column(name="WebSafeEmailStatus")
	private String webSafeEmailStatus;

	
	@OneToOne
	@JoinColumn(name = "id", insertable = false, updatable = false)
	@IndexedEmbedded
	@NotFound(action=NotFoundAction.IGNORE) 
    private PhoneCall phoneCall;

	@OneToOne
	@JoinColumn(name = "id", insertable = false, updatable = false)
	@IndexedEmbedded
	@NotFound(action=NotFoundAction.IGNORE) 
    private EmailIn emailIn;
	
	@OneToOne
	@JoinColumn(name = "ownerId", insertable = false, updatable = false)
	@IndexedEmbedded
	@NotFound(action=NotFoundAction.IGNORE) 
    private CfgPerson cfgPerson;
@Entity
@Table(name = "EmailIn")
public class EmailIn {

	@Id
	@Column(name="Id")
	private String id;

	@Field(normalizer = @Normalizer(definition = "lowercase"))
	@Column(name="FromAddress")
	private String fromAddress;

	@Column(name="FromPersonal")
	private String fromPersonal;

	@Column(name="ReplyToAddress")
	private String replyToAddress;

	@Column(name="ToAddresses")
	private String toAddresses;

	@Column(name="CcAddresses")
	private String ccAddresses;

	@Column(name="BccAddresses")
	private String bccAddresses;
	
	@Column(name="SentDate")
	private Timestamp sentDate;
	
	@Column(name="Mailbox")
	private String mailbox;

	@Column(name="WhichRuleMatched")
	private String whichRuleMatched;

	@Column(name="EmailOutId")
	private String emailOutId;

	@OneToOne(mappedBy = "emailIn")
	@NotFound(action=NotFoundAction.IGNORE) 
	@ContainedIn
    private Interaction interaction;
@Entity
@Table(name = "PhoneCall")
public class PhoneCall {

	@Id
	@Column(name="Id")
	private String id;
	
	@Column(name="Duration")
	private Integer duration;

	@Column(name="Outcome")
	private String outcome;

	@Field
	@Column(name="Phonenumber")
	private String phoneNumber;

	@Column(name="TConnectionId")
	private String tConnectionId;


	@OneToOne(mappedBy = "phoneCall")
	@NotFound(action=NotFoundAction.IGNORE) 
	@ContainedIn
    private Interaction interaction;
	
@Entity
@Table(name = "cfg_person")
public class CfgPerson {

	@Id
	@Column(name="dbid")
	private Integer dbid;

	@Column(name="tenant_dbid")
	private Integer tenantDbid;

	@Field(normalizer = @Normalizer(definition = "lowercase"))
	@Column(name="last_name")
	private String lastName;

	@Field(normalizer = @Normalizer(definition = "lowercase"))
	@Column(name="first_name")
	private String firstName;

	@Column(name="address_line1")
	private String addressLine1;

	@Column(name="address_line2")
	private String addressLine2;

	@Column(name="address_line3")
	private String addressLine3;
	
	@Column(name="address_line4")
	private String addressLine4;
	
	@Column(name="address_line5")
	private String addressLine5;
	
	@Column(name="office")
	private String office;
	
	@Column(name="home")
	private String home;
	
	@Column(name="mobile")
	private String mobile;
	
	@Column(name="pager")
	private String pager;
	
	@Column(name="fax")
	private String fax;
	
	@Column(name="modem")
	private String modem;
	
	@Column(name="phones_comment")
	private String phonesComment;
	
	@Column(name="birthdate")
	private String birthdate;
	
	@Column(name="comment_")
	private String comment;
	
	@Column(name="employee_id")
	private String employeeId;
	
	@Field
	@Field(normalizer = @Normalizer(definition = "lowercase"))
	@Column(name="user_name")
	private String userName;
	
	@Column(name="password")
	private String password;

	@Column(name="is_agent")
	private Integer isAgent;
	
	@Column(name="state")
	private Integer state;
	
	@Column(name="csid")
	private Integer csid;
	
	@Column(name="tenant_csid")
	private Integer tenantCsid;
	
	@Column(name="place_dbid")
	private Integer placeDbid;
	
	@Column(name="place_csid")
	private Integer placeCsid;
	
	@Column(name="capacity_dbid")
	private Integer capacityDbid;
	
	@Column(name="site_dbid")
	private Integer siteDbid;
	
	@Column(name="contract_dbid")
	private Integer contractDbid;

	@Column(name="salted_string")
	private String saltedString;

	@Column(name="ch_pass_on_login")
	private Integer chPassOnLogin;
	
	@Column(name="pass_updating")
	private Integer passUpdating;
	
	@Column(name="pass_hash_alg")
	private Integer passHashAlg;

	@OneToMany(mappedBy = "ownerId")
	@NotFound(action=NotFoundAction.IGNORE) 
	@ContainedIn
    private Set<Interaction> interaction;

Sorry, on thing i forgot to mention. The smaller dataset was actually in a different DB, same schema. So maybe thats the bottle neck.

The fact that the speed goes down over time is normal. It’s because we’re displaying the overall speed, i.e. total_indexed_count/total_time. Indexing is often orders of magnitude faster for the first few entities, which artificially increases the overall indexing speed for a long time, but after some time the number gets back to the actual value (in your case, 20 docs/s, which is indeed slow).

We probably should move to some rolling average of indexing speed, so that this stat is less confusing; I opened HSEARCH-4483 to address this.

Honestly, the most likely bottleneck is loading data from the database, because that’s where most of the resource-intensive work happens.

The very first thing to check is whether you have database indexes on your foreign key columns, so that association loading is as fast as possible. I’m talking about the foreign keys that are behind the associations Interaction#phoneCall, Interaction#emailIn, Interaction#cfgPerson.

Once that’s solved, if indexing is still slow, then enable SQL logging, have a look at the SQL that Hibernate Search uses to load your entities during mass indexing, and check that the queries execute in a reasonable time; if not, try to tune that, like you would for any other slow SQL query.

For more advice, see this section of the reference documentation and this answer on stackoverflow

If you run out of ideas to tune data loading from the database, and database loading performance seems reasonable to you, you can always try to tune Lucene indexing performance, in particular merging. But that will be slightly more complex, unless you’re already familiar with Lucene.

Thanks for the reply. I just came back to say, that it ended up being resource constraints on the DB side. We increased the compute resources (its in Azure) and the speed it much better.

1 Like

Glad it worked out. Thanks a lot for keeping us updated!

1 Like