Lucene Index in AEM - Part 1

Lucene Index in AEM


Lucene index supports both property constraints and full text constraints. Based on the index definition, it can be used to evaluate property constraints, full-text constraints, path restrictions and sorting. 

Lucene Index Node Structure

Lucene Index Definition/Structure - High level:

Mandatory Properties
NameTypeValue
typeStringlucene
asyncString[]Possible values - async, nrt, fultext-async
Optional/Supporting Properties
compatVersionLong2
Oak uses Lucene index implementation that does not support property constraints, index time aggregation by default. In order to use these features, set this property with value 2
blobSizeLong32768 (32kb - Default Value)
Size of each index file in repository. (for splitting while storing in NodeStore)
maxFieldLength Long10000 (Default value)
Numbers of terms indexed per field
nameStringname of the index
This will be used while logging
indexPathStringPath of the index defintion
If the index definition named customluceneIndex is defined under /oak:index in the repo, then /oak:index/customluceneIndex is the value for this property.
includedPaths String[]List of paths to be included in indexing
Only nodes defined under this path will be indexed
excludedPaths String[]List of paths to be excluded from indexing
Nodes defined under this path will not be indexed
queryPaths String[]List of paths for which this index is to be used
index is used/picked for query with specific path predicate - those paths can be provided here.
evaluatePathRestrictions Booleanfalse (Default)
If set to true, index will evaluate path restrictions.
Query with path predicate is respected while fetching results from index.
Example:
If we search for a text "we-retail" under the DAM path - /content/dam/we-retail
Index definition without this property - will return all the results which has the text - "we-retail". Query Engine will filter out results that are not under /content/dam/we-retail
Index definition with this property(value-true) - will return results under that path alone. 
codec StringName of Lucene Codec to use.
full text lucene indexing uses OakCodec by default which disables compression -> index size grows because of this.
To enable compression, we should set this property to -> Lucene46
Example: Full text Lucene available at /oak:index/lucene OOB
refresh Booleantrue
Refreshes stored index definition
On next async job execution cycle, index definition would be refreshed and this property will be removed upon refresh
functionName StringName to be used to enable index usage with native query support
For native queries(rep:native), we have a means to mention the index type. (Possible values supported are lucene or solr)
In case of using Lucene, if multiple Lucene indexes are available and if we want to use specific one for our query, then we can create this functionName property with some meaningful name as value(kind of identifier for this index) 
This name will then be used in native queries.
Example:
//*[rep:native('functionNameValue', 'native search query expression']
Index definition with this functionName will be picked for query execution.
useIfExists StringUseful in blue-green deployments, when using Composite Node Store
(Since Oak version 1.10.0)
In AEM, it is 6.5 version which has Oak version to be 1.10.2
Properties/Node that gets created automatically
reindexBooleanfalse
reindexCountLong1, very first time when the lucene index is created + first async job is run
number gets incremented by 1 everytime reindex is triggered.
(+) indexRules
(Node)
nt:unstructuredThis node with properties + few child nodes will be automatically created when we create lucene index with mandatory properties.
Significance:
Used to define node types and its properties to be indexed as part of this index definition. 
It can have any number of nodes defining the node types and each in turn can have any number of nodes defining the respective node's properties.
Example: OOB cqPageLucene has indexRules defined for node type - "cq:Page" and properties of cq:Page => jcr:title, cq:lastModified etc (each of these properties is a child node under the node cq:Page)
/oak:index/cqPageLucene/indexRules/cq:Page
Other additional child nodes as part of lucene index
(+) aggregates
(Node)
nt:unstructuredIt is defined based on primary node type and relative path patterns
It can have any number of node types and each in turn can have include(n) rules (for defining relative paths)
Significance:
To include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes.
If we would like to index jcr:content(cq:PageContent) of cq:Page up to certain depth, we can make use of aggregates node.
Example: cqPageLucene has aggregates defined for node type - "cq:PageContent" and include0 to include3(4 nodes) for defining paths up to the desired depth.
where each of the include rule will represent one hierarchy down with respect to the cq:PageContent
/oak:index/cqPageLucene/aggregates/cq:PageContent
(+) analyzers
(Node)
nt:unstructuredOption to specify Analyzer class directly or via composition (defining Tokenzier + Filter)
Significance:
Analyzers is used to analyze text while indexing and while searching via query execution.
It converts the given text into smaller units called Tokens (with help of Tokenizers + Filters) for the ease of searching
There are many in-built Analyzers which extracts keywords from text, converts to lower case, removes stop words/common words etc.
Most commonly used OOB Analyzer - StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) which will filter stop words, punctuation and converting to lower case. It can also recognize URLs 
Usage: For Full text search scenario - features like synonyms, stemming support.
Will try to create custom use case illustrating this in upcoming posts.
(+) tika
(Node)
nt:unstructuredOak uses Apache Tika to extract text from binary content.
Usage: Again in full text scenario, for displaying related binary results as part of search. 
Example : Search for a text - "we-retail" to display related images/pdf or any other related binary content. 
Will try to create custom use case illustrating this in upcoming posts.

Table above is an high level information of Lucene Index Definition - High level purpose of indexRules, aggregates, analyzers and tika  
Each of these in turn has further configurations (child nodes and respective properties) and has more details to it, will add in upcoming posts for better clarity.

Next step, we will create custom Lucene Property Index with mandatory properties.
Use case : Get all assets which has "cq:parentPath" property. 
(The same use case that we used for creating Property Index in previous blog post)
  • path=/content/dam/we-retail
  • type=dam:AssetContent
  • 1_property=cq:parentPath
  • 1_property.operation=exists
  • p.limit=-1
Video demo:

Comments

  1. Thank you so much for this useful article, you saved my day!!

    ReplyDelete
  2. The Best 10 Casinos with Jackpot City Games in 2021 - Mapy
    Jackpot City Games is a Microgaming-powered 수원 출장안마 online 광주 출장마사지 casino offering a fantastic selection 계룡 출장마사지 of slot games. 포천 출장안마 The company has been developing casino 거제 출장샵 games for years,

    ReplyDelete
  3. How can I add a property (checkbox) constraint to exclude from search results

    ReplyDelete

Post a Comment

Popular posts from this blog

Embedding Third party dependency/OSGi bundle in AEM application hosted in AEMasCS

Embed Third party dependency using bnd-maven-plugin

OSGI Factory Configuration implementation