Lucene Index in AEM

Lucene index supports both property constraints and full text constraints. Based on the index definition, it can be used to evaluate property constraints, full-text constraints, path restrictions and sorting.

Lucene Index Definition/Structure - High level:

Mandatory Properties
Name	Type	Value
type	String	lucene
async	String[]	Possible values - async, nrt, fultext-async
Optional/Supporting Properties
compatVersion	Long	2 Oak uses Lucene index implementation that does not support property constraints, index time aggregation by default. In order to use these features, set this property with value 2
blobSize	Long	32768 (32kb - Default Value) Size of each index file in repository. (for splitting while storing in NodeStore)
maxFieldLength	Long	10000 (Default value) Numbers of terms indexed per field
name	String	name of the index This will be used while logging
indexPath	String	Path of the index defintion If the index definition named customluceneIndex is defined under /oak:index in the repo, then /oak:index/customluceneIndex is the value for this property.
includedPaths	String[]	List of paths to be included in indexing Only nodes defined under this path will be indexed
excludedPaths	String[]	List of paths to be excluded from indexing Nodes defined under this path will not be indexed
queryPaths	String[]	List of paths for which this index is to be used index is used/picked for query with specific path predicate - those paths can be provided here.
evaluatePathRestrictions	Boolean	false (Default) If set to true, index will evaluate path restrictions. Query with path predicate is respected while fetching results from index. Example: If we search for a text "we-retail" under the DAM path - /content/dam/we-retail Index definition without this property - will return all the results which has the text - "we-retail". Query Engine will filter out results that are not under /content/dam/we-retail Index definition with this property(value-true) - will return results under that path alone.
codec	String	Name of Lucene Codec to use. full text lucene indexing uses *OakCodec* by default which disables compression -> index size grows because of this. To enable compression, we should set this property to -> Lucene46 Example: Full text Lucene available at /oak:index/lucene OOB
refresh	Boolean	true Refreshes stored index definition On next async job execution cycle, index definition would be refreshed and this property will be removed upon refresh
functionName	String	Name to be used to enable index usage with native query support For native queries(rep:native), we have a means to mention the index type. (Possible values supported are lucene or solr) In case of using Lucene, if multiple Lucene indexes are available and if we want to use specific one for our query, then we can create this functionName property with some meaningful name as value(kind of identifier for this index) This name will then be used in native queries. Example: //[rep:native('functionNameValue', 'native search query expression'] Index definition with this functionName* will be picked for query execution.
useIfExists	String	Useful in blue-green deployments, when using Composite Node Store (Since Oak version 1.10.0) In AEM, it is 6.5 version which has Oak version to be 1.10.2
Properties/Node that gets created automatically
reindex	Boolean	false
reindexCount	Long	1, very first time when the lucene index is created + first async job is run number gets incremented by 1 everytime reindex is triggered.
(+) indexRules (Node)	nt:unstructured	This node with properties + few child nodes will be automatically created when we create lucene index with mandatory properties. Significance: Used to define node types and its properties to be indexed as part of this index definition. It can have any number of nodes defining the node types and each in turn can have any number of nodes defining the respective node's properties. Example: OOB cqPageLucene has indexRules defined for node type - "cq:Page" and properties of cq:Page => jcr:title, cq:lastModified etc (each of these properties is a child node under the node cq:Page) /oak:index/cqPageLucene/indexRules/cq:Page
Other additional child nodes as part of lucene index
(+) aggregates (Node)	nt:unstructured	It is defined based on primary node type and relative path patterns It can have any number of node types and each in turn can have include(n) rules (for defining relative paths) Significance: To include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes. If we would like to index jcr:content(cq:PageContent) of cq:Page up to certain depth, we can make use of aggregates node. Example: cqPageLucene has aggregates defined for node type - "cq:PageContent" and include0 to include3(4 nodes) for defining paths up to the desired depth. where each of the include rule will represent one hierarchy down with respect to the cq:PageContent /oak:index/cqPageLucene/aggregates/cq:PageContent
(+) analyzers (Node)	nt:unstructured	Option to specify Analyzer class directly or via composition (defining Tokenzier + Filter) Significance: Analyzers is used to analyze text while indexing and while searching via query execution. It converts the given text into smaller units called Tokens (with help of Tokenizers + Filters) for the ease of searching There are many in-built Analyzers which extracts keywords from text, converts to lower case, removes stop words/common words etc. Most commonly used OOB Analyzer - StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) which will filter stop words, punctuation and converting to lower case. It can also recognize URLs Usage: For Full text search scenario - features like synonyms, stemming support. Will try to create custom use case illustrating this in upcoming posts.
(+) tika (Node)	nt:unstructured	Oak uses Apache Tika to extract text from binary content. Usage: Again in full text scenario, for displaying related binary results as part of search. Example : Search for a text - "we-retail" to display related images/pdf or any other related binary content. Will try to create custom use case illustrating this in upcoming posts.

Table above is an high level information of Lucene Index Definition - High level purpose of indexRules, aggregates, analyzers and tika

Each of these in turn has further configurations (child nodes and respective properties) and has more details to it, will add in upcoming posts for better clarity.

Next step, we will create custom Lucene Property Index with mandatory properties.

Use case : Get all assets which has "cq:parentPath" property.

(The same use case that we used for creating Property Index in previous blog post)

path=/content/dam/we-retail
type=dam:AssetContent
1_property=cq:parentPath
1_property.operation=exists
p.limit=-1

Video demo:

Comments

Blogs4FunOctober 21, 2020 at 8:28 AM
Thank you so much for this useful article, you saved my day!!
cageobersteinMarch 4, 2022 at 2:19 PM
The Best 10 Casinos with Jackpot City Games in 2021 - Mapy
Jackpot City Games is a Microgaming-powered 수원 출장안마 online 광주 출장마사지 casino offering a fantastic selection 계룡 출장마사지 of slot games. 포천 출장안마 The company has been developing casino 거제 출장샵 games for years,
RinkiMarch 21, 2022 at 6:51 AM
How can I add a property (checkbox) constraint to exclude from search results

Search This Blog

AEM Learning Repository

Lucene Index in AEM - Part 1

Comments

Post a Comment

Popular posts from this blog

Embedding Third party dependency/OSGi bundle in AEM application hosted in AEMasCS

Embed Third party dependency using bnd-maven-plugin

Creation of Template Types for Editable templates