Lucene Index in AEM - Part 3


Analyzers in Lucene Index

This post illustrates the use of analyzers in full text search with sample use case.

Apache Lucene Analyzers :
Analyzers as with the name is used to analyze the text both at the time of indexing and at the time of searching (via query execution)
  • An analyzer examines the text of fields and generates a token stream. It can be either a 
    • Single Java class or 
    • Composed of a series of Tokenizer and Filter Java classes.
  • Tokenizer breaks the data into lexical units or tokens
  • Filters then examines these tokens -> amends/discard/create new one based on the configuration
  • Series of Tokenizer + Filters => Analyzer
  • There are direct Analyzer classes, Tokenizer and Filters available OOB. Based on our requirement we can choose to use either direct Analyzer or Tokenizer + Filter combination.(Analyzer via composition)
  • Examples:
  • Analyzer
    • StandardAnalyzer (org.apache.lucene.analysis.standard.StandardAnalyzer)
    • Removes stop words, converts to lowercase, recognize URLs and emails - most commonly used
  • Tokenizer
    • Standard (org.apache.lucene.analysis.standard.StandardTokenizerFactory)
    • Splits the text field into tokens, treating whitespace and punctuation as delimiters.
  • Filters :
    • Stopwords Filter (org.apache.lucene.analysis.core.StopFilterFactory) - Removes stop words
    • Lowercase Filter (org.apache.lucene.analysis.core.LowerCaseFilterFactory) - Converts token to lowercase
    • PorterStem Filter (org.apache.lucene.analysis.en.PorterStemFilterFactory) - Creates stem words from the tokens
  • In Lucene Full Text Index defintion, 
    • If we are opting to using direct Analyzer class, fully qualified Java class name is to be mentioned (using a property called "class" - highlighted in demo video)
    • If we are using Tokenizer and Filter combination. name without Factory suffix can be used. (Standard for StandardTokenizerFactory and PorterStem for PorterStemFilterFactory - this is again highlighted in demo video) 
    • Note : If we are using Analyzer via composition, class property need to be removed. 
Use case:
We will look into the common need for a full text search - Synonym and Stemming support
To follow along with same use case/to use same Lucene full text index created in previous posts (part 1 and part 2), will consider below two scenario

In we-retail DAM assets,
  • /content/dam/we-retail/en/activities (which again has biking, climbing. hiking etc as its kind)
    • We will create stemming filter as part of analyzers in Lucene Full text Index to fetch assets related to "activities" when we search using its stem word like "activity"
  • /content/dam/we-retail/en/products/apparel (which again has gloves, coats, pants and so on under apparel cateory)
    • We will create Synonym filter as part of analyzers in Lucene Full text Index to fetch assets related to "apparel" when we search using its synonyms like "clothing or garments"
Video Demo:
Stemming:
  • Highlights the use of PorterStem Filter with Standard Tokenizer (Analyzer via composition)
  • EnglishAnalyzer class is used which has PorterStemFilter in it. (Direct Analyzer class)


Synonym:
Synonym Filter with Standard Tokenizer is used (Analyzer via composition)

Comments

  1. Hi VijayaLakshmi, I am trying to include Lemmatization into lucene indexing via a custom lucene token fileter, however AEM doesn't pick-up the token filter from my core bundle. Do you know how to register a custom Lucene Filter with AEM ? so that it can be used in the composition?

    ReplyDelete

Post a Comment

Popular posts from this blog

Embedding Third party dependency/OSGi bundle in AEM application hosted in AEMasCS

Embed Third party dependency using bnd-maven-plugin

OSGI Factory Configuration implementation