Lucene Index in AEM - Part 1
Lucene Index Definition/Structure - High level:
Mandatory Properties | ||
Name | Type | Value |
type | String | lucene |
async | String[] | Possible values - async, nrt, fultext-async |
Optional/Supporting Properties | ||
compatVersion | Long | 2 Oak uses Lucene index implementation that does not support property constraints, index time aggregation by default. In order to use these features, set this property with value 2 |
blobSize | Long | 32768 (32kb - Default Value) Size of each index file in repository. (for splitting while storing in NodeStore) |
maxFieldLength | Long | 10000 (Default value) Numbers of terms indexed per field |
name | String | name of the index This will be used while logging |
indexPath | String | Path of the index defintion If the index definition named customluceneIndex is defined under /oak:index in the repo, then /oak:index/customluceneIndex is the value for this property. |
includedPaths | String[] | List of paths to be included in indexing Only nodes defined under this path will be indexed |
excludedPaths | String[] | List of paths to be excluded from indexing Nodes defined under this path will not be indexed |
queryPaths | String[] | List of paths for which this index is to be used index is used/picked for query with specific path predicate - those paths can be provided here. |
evaluatePathRestrictions | Boolean | false (Default) If set to true, index will evaluate path restrictions. Query with path predicate is respected while fetching results from index. Example: If we search for a text "we-retail" under the DAM path - /content/dam/we-retail Index definition without this property - will return all the results which has the text - "we-retail". Query Engine will filter out results that are not under /content/dam/we-retail Index definition with this property(value-true) - will return results under that path alone. |
codec | String | Name of Lucene Codec to use. full text lucene indexing uses OakCodec by default which disables compression -> index size grows because of this. To enable compression, we should set this property to -> Lucene46 Example: Full text Lucene available at /oak:index/lucene OOB |
refresh | Boolean | true Refreshes stored index definition On next async job execution cycle, index definition would be refreshed and this property will be removed upon refresh |
functionName | String | Name to be used to enable index usage with native query support For native queries(rep:native), we have a means to mention the index type. (Possible values supported are lucene or solr) In case of using Lucene, if multiple Lucene indexes are available and if we want to use specific one for our query, then we can create this functionName property with some meaningful name as value(kind of identifier for this index) This name will then be used in native queries. Example: //*[rep:native('functionNameValue', 'native search query expression'] Index definition with this functionName will be picked for query execution. |
useIfExists | String | Useful in blue-green deployments, when using Composite Node Store (Since Oak version 1.10.0) In AEM, it is 6.5 version which has Oak version to be 1.10.2 |
Properties/Node that gets created automatically | ||
reindex | Boolean | false |
reindexCount | Long | 1, very first time when the lucene index is created + first async job is run number gets incremented by 1 everytime reindex is triggered. |
(+) indexRules (Node) | nt:unstructured | This node with properties + few child nodes will be automatically created when we create lucene index with mandatory properties. Significance: Used to define node types and its properties to be indexed as part of this index definition. It can have any number of nodes defining the node types and each in turn can have any number of nodes defining the respective node's properties. Example: OOB cqPageLucene has indexRules defined for node type - "cq:Page" and properties of cq:Page => jcr:title, cq:lastModified etc (each of these properties is a child node under the node cq:Page) /oak:index/cqPageLucene/indexRules/cq:Page |
Other additional child nodes as part of lucene index | ||
(+) aggregates (Node) | nt:unstructured | It is defined based on primary node type and relative path patterns It can have any number of node types and each in turn can have include(n) rules (for defining relative paths) Significance: To include the contents of descendant nodes into a single node to easier search on content that is scattered across multiple nodes. If we would like to index jcr:content(cq:PageContent) of cq:Page up to certain depth, we can make use of aggregates node. Example: cqPageLucene has aggregates defined for node type - "cq:PageContent" and include0 to include3(4 nodes) for defining paths up to the desired depth. where each of the include rule will represent one hierarchy down with respect to the cq:PageContent /oak:index/cqPageLucene/aggregates/cq:PageContent |
(+) analyzers (Node) | nt:unstructured | Option to specify Analyzer class directly or via composition (defining Tokenzier + Filter) Significance: Analyzers is used to analyze text while indexing and while searching via query execution. It converts the given text into smaller units called Tokens (with help of Tokenizers + Filters) for the ease of searching There are many in-built Analyzers which extracts keywords from text, converts to lower case, removes stop words/common words etc. Most commonly used OOB Analyzer - StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) which will filter stop words, punctuation and converting to lower case. It can also recognize URLs Usage: For Full text search scenario - features like synonyms, stemming support. Will try to create custom use case illustrating this in upcoming posts. |
(+) tika (Node) | nt:unstructured | Oak uses Apache Tika to extract text from binary content. Usage: Again in full text scenario, for displaying related binary results as part of search. Example : Search for a text - "we-retail" to display related images/pdf or any other related binary content. Will try to create custom use case illustrating this in upcoming posts. |
Table above is an high level information of Lucene Index Definition - High level purpose of indexRules, aggregates, analyzers and tika
Each of these in turn has further configurations (child nodes and respective properties) and has more details to it, will add in upcoming posts for better clarity.
Next step, we will create custom Lucene Property Index with mandatory properties.
Use case : Get all assets which has "cq:parentPath" property.
(The same use case that we used for creating Property Index in previous blog post)
- path=/content/dam/we-retail
- type=dam:AssetContent
- 1_property=cq:parentPath
- 1_property.operation=exists
- p.limit=-1
Video demo:
Thank you so much for this useful article, you saved my day!!
ReplyDeleteThe Best 10 Casinos with Jackpot City Games in 2021 - Mapy
ReplyDeleteJackpot City Games is a Microgaming-powered 수원 출장안마 online 광주 출장마사지 casino offering a fantastic selection 계룡 출장마사지 of slot games. 포천 출장안마 The company has been developing casino 거제 출장샵 games for years,
How can I add a property (checkbox) constraint to exclude from search results
ReplyDelete