Apache Tika config in Lucene Index and Query Flow Summary

June 22, 2020

This post is about the Apache tika config on Lucene full text Index and summary on queries/indexing that we discussed in past few posts.

Apache Tika is used to detect and extract the text from varying file formats. It consist of Detector and Parser where Detector is used to detect the file format and Parser will parse the contents of the file.

In Lucene Index, Oak uses the default config which uses

TypeDetector - org.apache.tika.detect.TypeDetector

This detector uses the content type available in input metadata to arrive at the content type/mimeType

DefaultParser - org.apache.tika.parser.DefaultParser

Composite parser which is based on all available specific parser implementations.
Eg. PDFParser, MP4Parser and all other parser implementation available in Apache Tika.

Empty Parser - org.apache.tika.parser.EmptyParser

As with the name, it is a dummy parser/ not parses anything
Hence defining mime types within Empty Parser is equivalent to excluding them from text extraction.
In Default config, compressed assets and images are all excluded from extraction (related mimeType defined within Empty Parser)

Default config file is available here

Given the detectors and parser available in default config, most common/possible use case to consider for custom config need is to exclude certain mimeTypes from extraction.

(It will help to reduce the volume of repository of such indexed data. )

Video Demo:

Have considered PDF for this demo (certification guide PDF is uploaded to my local instance in we-retail DAM path)

Summary on Queries and Indexing :

When we write functionality related to Query we can go about as following:

Query Debugger : (http://localhost:4502/libs/cq/search/content/querydebug.html)

Write queries in Query Debugger and execute
Use p.limit=-1 to get the complete result sets to check if we are getting expected results.

Explain Query Tool : (Operations -> Diagnosis -> Query Performance)

Once the query is framed, execute the same in Explain Query tool to observe the index used, execution time taken, query plan and cost calculation.

Decision on Index:

If it is a traversal query and if we foresee the content volume to be huge, consider creating an index.
To arrive at the index type to be used, again we need to think of on a long run. In general, per the adobe docs, Lucene Index is recommended as it can cover many properties under one index definition, flexible and more options to the index definition (in the form of supporting properties).
However, if we are looking for accurate results and if query involves unique constraint, then we need to consider creating Property Index.

While creating Index Definition:

After deciding on creating the index and hence the index type -> we can start by creating index definition with its mandatory properties.
By understanding the significance of each of the optional/supporting properties, we can make the index definition to be specific and thereby reducing the volume of the content indexed per our need, ultimately leading to faster query execution.
For creating Lucene Index, we can make use of Oak Index Definition Generator, by pasting XPath or JCR-SQL2 queries (which we can get using Query Debugger Tool or in CRXDE -> Tools -> Query)

Reindexing:

Once when we create new index definition for the first time, indexing happens as with the persistence in case of Property Index and on next AsyncIndexJobUpdate run in case of Lucene Index
Apart from this, when there is a change in index definition any further, it is obvious to trigger a reindex.
In case of Lucene Index, we also have an option of refresh (by using property refresh -> true)

Troubleshooting:

APIs for logging (To be added in Log Support in Felix console on need basis - http://localhost:4502/system/console/slinglog)

Queries

org.apache.jackrabbit.oak.query
com.day.cq.search (If we are using QueryBuilder API/Query predicate logic)

Indexing

org.apache.jackrabbit.oak.plugins.index

Async Index Job execution related

org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate

MBeans: Available in JMX console (http://localhost:4502/system/console/jmx)

IndexStats (async, async-index, fulltext-async) - Separate async job run based on the indexing lane configured in the index definition(async property value indicates the indexer lane) and hence separate MBean for each of it

async and fulltext-async are the two possible indexing lanes for Lucene Index
async-reindex lane is for reindexing Property Index in Asynchronous way.

LuceneIndex (Lucene Index statistics)
PropertyIndexStats (Property Index statistics)

Other Tools related to Queries and Indexing:

Tools -> Operations -> Diagnosis ->

Query Performance in Diagnosis also lists Slow Queries and Popular queries in our instance.
Index Manager in Diagnosis lists the available indexes in our instance (indexes available under /oak:index)

In case of issues/for access of reports -> Tools -> Operations -> Health Reports ->

Asynchronous Indexes
Large Lucene Index

Apart from the above high level flow from Development standpoint in the process of Query based functionalities, specific cases for troubleshooting, reindexing scenarios are detailed in Adobe helpx docs.

List of other posts related to Queries and Indexing:

Search This Blog

AEM Learning Repository

Apache Tika config in Lucene Index and Query Flow Summary

Comments

Post a Comment

Popular posts from this blog

Embedding Third party dependency/OSGi bundle in AEM application hosted in AEMasCS

Embed Third party dependency using bnd-maven-plugin

Creation of Template Types for Editable templates