com.puppycrawl.tools.checkstyle.site.SearchIndexGenerator

public final class SearchIndexGenerator extends Object

Generates search-index.json from the Checkstyle XDoc source files.

This is a plain Java main() class - no Maven plugin API required. It is invoked by exec-maven-plugin during the process-classes phase so the index is ready when Maven Site copies static resources.

Output is written as a JSON file. The search widget fetches this file using the fetch API and parses it to populate the search index.

Key design decisions

No duplicates. Only plain .xml files are processed for check/filter/filefilter directories. The .xml.template and .xml.vm siblings are pre-render source files that would produce identical URLs and duplicate entries. A secondary URL-keyed dedup guard is also applied across the entire output list.
Identifiable example titles. Both -config and -code example paragraphs are indexed. Their titles use the pattern "<CheckName>: Example1 [config]" and "<CheckName>: Example1 [code]" so users can distinguish a configuration snippet from its matching Java code example in search results.
Full general-page indexing. Each meaningful <section> in general documentation pages (e.g. config_system_properties, writingchecks, cmdline) is indexed as its own entry with the full section text used for keyword extraction - not just the first sentence. This makes page-internal headings discoverable.
Disambiguated generic titles. Structural section names that are repeated across many pages (e.g. "Overview", "Debug", "Contributing") are prefixed with the page title, yielding e.g. "Eclipse IDE: Debug" instead of a bare "Debug" that collides with "IntelliJ IDE: Debug".
Junk pages excluded. Release notes, auto-generated style coverage reports and bare category aggregator stubs are skipped.

Usage (called by exec-maven-plugin in pom.xml):

   java SearchIndexGenerator <xdocsDir> <outputFilePath>
   java SearchIndexGenerator src/site/xdoc target/site/search-index.json

Field Summary

Fields

Modifier and Type

Field

Description

private static final String

ANCHOR_SEPARATOR

String literal for anchor separator.

private static final String

BODY

String literal for body element.

private static final String

CHECKS

String literal for checks directory.

private static final Map<String,String>

CHECKS_CATEGORY_DISPLAY_NAMES

Display names for the check category subdirectories under checks/, keyed by lowercase directory name.

private static final String

COMMA_STR

String literal for comma.

private static final Pattern

CONFIG_CATEGORY

Matches config_<category>.xml files that redirect to check category pages.

private static final String

CONTENT

String literal for Content.

private static final String

DESCRIPTION

String literal for description element.

private static final Pattern

DOC_EXTENSION

Matches .xml, .xml.vm and .xml.template.

private static final String

ELLIPSIS

String literal for ellipsis.

private List<SearchIndexEntry>

entries

Accumulated search index entries.

private static final String

EXAMPLE_LABEL_CODE

Suffix label appended to example titles for Java code examples.

private static final String

EXAMPLE_LABEL_CONFIG

Suffix label appended to example titles for configuration snippets.

private static final Pattern

EXAMPLE_PARAGRAPH_ID

Matches an example paragraph id attribute that has a suffix of either -config or -code, capturing the base label (e.g.

private static final String

EXAMPLE_TYPE

String literal for Example document type.

private static final String

EXAMPLES_SUBSECTION

String literal for the Examples subsection name.

private static final String

EXTERNAL_GENERAL_ENTITIES

String literal for external general entities feature.

private static final String

EXTERNAL_PARAMETER_ENTITIES

String literal for external parameter entities feature.

private static final String

FILEFILTERS_DIR

Constant for the filefilters directory.

private static final String

FILTERS_DIR

Constant for the filters directory.

private static final String

GENERAL

String literal for General category.

private static final Set<String>

GENERIC_SECTION_NAMES

Generic section/subsection names that are structurally repeated across many unrelated general pages (IDE setup guides, writing-* guides, etc).

private static final String

ID_ATTR

String literal for id attribute.

private static final String

INDEX_HTML

Constant for the index file name.

private static final String

INDEX_XML

String literal for index.xml.

private static final int

MAX_DESCRIPTION_LENGTH

Magic number for maximum description length.

private static final int

MAX_KEYWORDS

Magic number for maximum keywords.

private static final int

MIN_WORD_LENGTH

Magic number for minimum word length.

private static final String

NAME_ATTR

String literal for name attribute.

private static final Pattern

NON_ALPHANUMERIC

Non-alphanumeric pattern.

private static final String

PARSE_FAILURE_MSG

Exception message prefix used when an XDoc file fails to parse.

private static final String

PATH_SEPARATOR

String literal for path separator in URLs.

private static final Pattern

PLAIN_XML

Matches only plain .xml files (not .xml.vm or .xml.template).

private static final String

PROPERTIES_FRAGMENT

String literal for the Properties subsection name fragment.

private static final String

PROPERTY_TYPE

String literal for Property document type.

private static final String

SECTION

String literal for section element.

private Set<String>

seenUrls

Deduplication guard for URLs.

private static final String

SPACE

String literal for space.

private static final char

SPACE_CHAR

Character literal for space.

private static final Set<String>

STOP_WORDS

Stop words: too generic to be useful as search keywords.

private static final String

SUBSECTION

String literal for subsection element.

private static final String

TITLE

String literal for title element.

private static final String

TITLE_SEPARATOR

String literal for colon separator used in disambiguated titles.

private static final Pattern

WHITESPACE

Whitespace pattern.
Constructor Summary

Constructors

Modifier

Constructor

Description

private

SearchIndexGenerator()

Prevent instantiation.
Method Summary

Modifier and Type

Method

Description

private void

addIfNew(SearchIndexEntry entry)

Adds an entry to the output list only if its URL has not been seen before.

private static SearchIndexEntry

buildExampleEntry(Element paragraph, String checkName, String baseUrl, String category)

Builds a single example entry from a paragraph element.

private static List<SearchIndexEntry>

buildGeneralPageEntries(File xmlFile)

Builds one search entry per top-level <section> in a general documentation page, using each section's full text for keyword extraction so that page-internal content is fully discoverable.

private static SearchIndexEntry

buildMainEntry(Document doc, File xmlFile, String category, String type, String baseUrl)

Builds the main search entry representing an entire check/filter document.

private static String

buildUrl(File xmlFile, File xdocsDir)

Builds the root-relative URL for an XDoc file, without any anchor.

private static String

capitalise(String input)

Capitalises the first character of a string.

private static String

derivePageTitle(Document doc, File xmlFile)

Derives a fallback page title from the document's <title> element or, failing that, from the filename.

private static String

disambiguateTitle(String sectionName, String pageTitle)

Disambiguates a section title when it is a generic, structurally repeated header (see GENERIC_SECTION_NAMES).

private static String

doxiaAnchorFor(String sectionName)

Converts a Doxia <section name="..."> value into the anchor id Doxia generates for it in the rendered HTML by replacing runs of whitespace with single underscores.

private void

execute(String... args)

Internal execution method to avoid static context for the logger.

private static String

extractAggregateDescription(NodeList sections)

Aggregates description from sections, taking the first non-empty Description subsection found across all sections in the document.

private static String

extractAggregateKeywords(String title, NodeList sections)

Aggregates keywords from sections using all section text so that the main check entry is discoverable by any term in the document.

private static String

extractDescription(Element section)

Extracts the first sentence of the Description subsection.

private static List<SearchIndexEntry>

extractExampleEntries(Document doc, String baseUrl, String category)

Extracts per-example search entries from a check/filter document.

private static String

extractFirstSentenceOrTruncated(String text)

Returns the first sentence of the given text (up to and including the first period), or the text truncated to MAX_DESCRIPTION_LENGTH with an ellipsis if no period is found within range.

private static String

extractKeywordsFromText(String text)

Extracts keywords from free-form text by splitting on non-word characters and filtering short and stop words.

private static void

extractPropertiesFromRows(Element propertiesSubsection, String checkName, String baseUrl, String category, List<SearchIndexEntry> propertyEntries)

Extracts property entries from table rows and adds them to the list.

private static List<SearchIndexEntry>

extractPropertyEntries(Document doc, String baseUrl, String category)

Extracts per-property search entries from a check/filter document.

private static String

extractTitle(Document doc, File xmlFile, NodeList sections)

Extracts the document title from the <title> element, falling back to the first non-empty, non-"Content" section name, and finally to a capitalised version of the file name.

private static Element

findSubsectionByPrefix(Element section, String fragment)

Finds a subsection within a section whose lowercased name contains the given fragment (e.g.

static void

main(String... args)

Main entry point called by exec-maven-plugin.

private static Document

parseXml(File xmlFile)

Parses the XML file into a Document with external entity resolution disabled for security.

private void

processChecksDirectory(File checksDir, File xdocsDir)

Walks src/xdocs/checks/ and processes each category subdirectory.

private void

processDirectory(File dir, File xdocsDir, String category, String type)

Processes all plain .xml files in a directory (non-recursive).

private void

processGeneralPage(File xmlFile)

Parses a single general-documentation XDoc page and adds its per-section entries to the index.

private void

processGeneralPages(File xdocsDir)

Adds entries for the top-level general documentation pages.

private static void

processPropertyRow(NodeList cells, String checkName, String baseUrl, String category, List<SearchIndexEntry> propertyEntries)

Processes a single property row and adds an entry if valid.

private void

processXmlFile(File xmlFile, File xdocsDir, String category, String type)

Parses a single check/filter XDoc file and adds its main, example, and property entries to the index.

private static Element

requireBody(Document doc, String identifier)

Returns the document's <body> element, failing fast if it is absent.

private static String

resolvePageUrl(File xmlFile, File xdocsDir)

Resolves the correct URL for a general page file.

private static String

truncate(String text, int maxLength)

Truncates text to the given max length, appending an ellipsis if truncation occurred.

private static void

writeJson(List<SearchIndexEntry> indexEntries, Path outputFilePath)

Writes all index entries to the output file.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- CHECKS
  
  private static final String CHECKS
  
  String literal for checks directory.
  See Also:
  
  Constant Field Values
- COMMA_STR
  
  private static final String COMMA_STR
  
  String literal for comma.
  See Also:
  
  Constant Field Values
- SPACE
  
  private static final String SPACE
  
  String literal for space.
  See Also:
  
  Constant Field Values
- SPACE_CHAR
  
  private static final char SPACE_CHAR
  
  Character literal for space.
  See Also:
  
  Constant Field Values
- TITLE_SEPARATOR
  
  private static final String TITLE_SEPARATOR
  
  String literal for colon separator used in disambiguated titles.
  See Also:
  
  Constant Field Values
- ELLIPSIS
  
  private static final String ELLIPSIS
  
  String literal for ellipsis.
  See Also:
  
  Constant Field Values
- EXTERNAL_GENERAL_ENTITIES
  
  private static final String EXTERNAL_GENERAL_ENTITIES
  
  String literal for external general entities feature.
  See Also:
  
  Constant Field Values
- EXTERNAL_PARAMETER_ENTITIES
  
  private static final String EXTERNAL_PARAMETER_ENTITIES
  
  String literal for external parameter entities feature.
  See Also:
  
  Constant Field Values
- GENERAL
  
  private static final String GENERAL
  
  String literal for General category.
  See Also:
  
  Constant Field Values
- EXAMPLE_TYPE
  
  private static final String EXAMPLE_TYPE
  
  String literal for Example document type.
  See Also:
  
  Constant Field Values
- PROPERTY_TYPE
  
  private static final String PROPERTY_TYPE
  
  String literal for Property document type.
  See Also:
  
  Constant Field Values
- SUBSECTION
  
  private static final String SUBSECTION
  
  String literal for subsection element.
  See Also:
  
  Constant Field Values
- NAME_ATTR
  
  private static final String NAME_ATTR
  
  String literal for name attribute.
  See Also:
  
  Constant Field Values
- ID_ATTR
  
  private static final String ID_ATTR
  
  String literal for id attribute.
  See Also:
  
  Constant Field Values
- INDEX_XML
  
  private static final String INDEX_XML
  
  String literal for index.xml.
  See Also:
  
  Constant Field Values
- FILTERS_DIR
  
  private static final String FILTERS_DIR
  
  Constant for the filters directory.
  See Also:
  
  Constant Field Values
- FILEFILTERS_DIR
  
  private static final String FILEFILTERS_DIR
  
  Constant for the filefilters directory.
  See Also:
  
  Constant Field Values
- INDEX_HTML
  
  private static final String INDEX_HTML
  
  Constant for the index file name.
  See Also:
  
  Constant Field Values
- CONTENT
  
  private static final String CONTENT
  
  String literal for Content.
  See Also:
  
  Constant Field Values
- EXAMPLES_SUBSECTION
  
  private static final String EXAMPLES_SUBSECTION
  
  String literal for the Examples subsection name.
  See Also:
  
  Constant Field Values
- BODY
  
  private static final String BODY
  
  String literal for body element.
  See Also:
  
  Constant Field Values
- SECTION
  
  private static final String SECTION
  
  String literal for section element.
  See Also:
  
  Constant Field Values
- TITLE
  
  private static final String TITLE
  
  String literal for title element.
  See Also:
  
  Constant Field Values
- DESCRIPTION
  
  private static final String DESCRIPTION
  
  String literal for description element.
  See Also:
  
  Constant Field Values
- ANCHOR_SEPARATOR
  
  private static final String ANCHOR_SEPARATOR
  
  String literal for anchor separator.
  See Also:
  
  Constant Field Values
- PATH_SEPARATOR
  
  private static final String PATH_SEPARATOR
  
  String literal for path separator in URLs.
  See Also:
  
  Constant Field Values
- PROPERTIES_FRAGMENT
  
  private static final String PROPERTIES_FRAGMENT
  
  String literal for the Properties subsection name fragment.
  See Also:
  
  Constant Field Values
- PARSE_FAILURE_MSG
  
  private static final String PARSE_FAILURE_MSG
  
  Exception message prefix used when an XDoc file fails to parse.
  See Also:
  
  Constant Field Values
- EXAMPLE_LABEL_CONFIG
  
  private static final String EXAMPLE_LABEL_CONFIG
  
  Suffix label appended to example titles for configuration snippets. Yields e.g. "AnnotationLocation: Example1 [config]".
  See Also:
  
  Constant Field Values
- EXAMPLE_LABEL_CODE
  
  private static final String EXAMPLE_LABEL_CODE
  
  Suffix label appended to example titles for Java code examples. Yields e.g. "AnnotationLocation: Example1 [code]".
  See Also:
  
  Constant Field Values
- MIN_WORD_LENGTH
  
  private static final int MIN_WORD_LENGTH
  
  Magic number for minimum word length.
  See Also:
  
  Constant Field Values
- MAX_KEYWORDS
  
  private static final int MAX_KEYWORDS
  
  Magic number for maximum keywords.
  See Also:
  
  Constant Field Values
- MAX_DESCRIPTION_LENGTH
  
  private static final int MAX_DESCRIPTION_LENGTH
  
  Magic number for maximum description length.
  See Also:
  
  Constant Field Values
- WHITESPACE
  
  private static final Pattern WHITESPACE
  
  Whitespace pattern.
- NON_ALPHANUMERIC
  
  private static final Pattern NON_ALPHANUMERIC
  
  Non-alphanumeric pattern.
- PLAIN_XML
  
  private static final Pattern PLAIN_XML
  
  Matches only plain .xml files (not .xml.vm or .xml.template). Used when scanning check/filter/filefilter directories to avoid processing pre-render source templates and producing duplicate index entries.
- DOC_EXTENSION
  
  private static final Pattern DOC_EXTENSION
  
  Matches .xml, .xml.vm and .xml.template. Used only for URL building (stripping the extension to produce a .html path) and for the general-pages scanner where we want to exclude templates by name rather than by extension.
- CONFIG_CATEGORY
  
  private static final Pattern CONFIG_CATEGORY
  
  Matches config_<category>.xml files that redirect to check category pages. Captures the category name (e.g. "metrics" from "config_metrics.xml") in group 1.
- EXAMPLE_PARAGRAPH_ID
  
  private static final Pattern EXAMPLE_PARAGRAPH_ID
  Matches an example paragraph id attribute that has a suffix of either -config or -code, capturing the base label (e.g. "Example1") in group 1 and the type ("config" or "code") in group 2.
  Example ids found in XDoc source:
  
  id="Example1-config" -> label "Example1", type "config"
  
  id="Example1-code" -> label "Example1", type "code"
- GENERIC_SECTION_NAMES
  
  private static final Set<String> GENERIC_SECTION_NAMES
  
  Generic section/subsection names that are structurally repeated across many unrelated general pages (IDE setup guides, writing-* guides, etc). On their own they are meaningless in search results ("Debug" appears identically in eclipse.xml, idea.xml, and netbeans.xml) so when one of these is used as a section title it is always disambiguated with the source page's own title, e.g. "Eclipse IDE: Debug".
- CHECKS_CATEGORY_DISPLAY_NAMES
  
  private static final Map<String,String> CHECKS_CATEGORY_DISPLAY_NAMES
  
  Display names for the check category subdirectories under checks/, keyed by lowercase directory name. Every directory that exists under checks/ must have an entry here - processChecksDirectory(java.io.File, java.io.File) fails fast if one is missing, so a contributor adding a new category is forced to register its display name instead of getting a guessed-at label.
- STOP_WORDS
  
  private static final Set<String> STOP_WORDS
  
  Stop words: too generic to be useful as search keywords.
- entries
  
  private List<SearchIndexEntry> entries
  
  Accumulated search index entries.
- seenUrls
  
  private Set<String> seenUrls
  
  Deduplication guard for URLs.
Constructor Details
- SearchIndexGenerator
  
  private SearchIndexGenerator()
  
  Prevent instantiation.
Method Details
- main
  
  public static void main(String... args) throws IOException
  
  Main entry point called by exec-maven-plugin.
  
  Parameters:
  
  args - args[0] = path to src/xdocs, args[1] = path to target/site
  
  Throws:
  
  IOException - on file write failure
  
  IllegalArgumentException - if args are missing
  
  IllegalStateException - if xdocsDir is missing
- execute
  
  private void execute(String... args) throws IOException
  
  Internal execution method to avoid static context for the logger.
  
  Parameters:
  
  args - args[0] = path to src/xdocs, args[1] = output file path
  
  Throws:
  
  IOException - on file write failure
  
  IllegalArgumentException - if args are missing
  
  IllegalStateException - if xdocsDir is missing
- processChecksDirectory
  
  private void processChecksDirectory(File checksDir, File xdocsDir)
  
  Walks src/xdocs/checks/ and processes each category subdirectory.
  Every directory found here must have a corresponding entry in CHECKS_CATEGORY_DISPLAY_NAMES; an unmapped directory likely means a new check category was added without registering its display name, so this fails fast rather than guessing a label from the directory name.
  
  Parameters:
  
  checksDir - the checks root directory
  
  xdocsDir - the xdocs root (used for URL building)
  
  Throws:
  
  IllegalStateException - if checksDir cannot be listed, or if one of its subdirectories has no entry in CHECKS_CATEGORY_DISPLAY_NAMES
- processDirectory
  
  private void processDirectory(File dir, File xdocsDir, String category, String type)
  
  Processes all plain .xml files in a directory (non-recursive). index.xml files and any file whose name ends with .xml.template or .xml.vm are skipped.
  Skipping templates is critical: every check page has a sibling *.xml.template file that resolves to the same HTML URL. Without this filter both files would be processed, producing two identical (or near-identical) main entries plus doubled example and property entries for every check.
  
  For each plain .xml file, the main check/filter entry, per-example entries (both config and code), and per-property entries are added.
  
  Parameters:
  
  dir - directory to scan
  
  xdocsDir - xdocs root (used for URL building)
  
  category - category label for all entries in this directory
  
  type - document type ("Check", "Filter", "File Filter")
- processXmlFile
  
  private void processXmlFile(File xmlFile, File xdocsDir, String category, String type)
  
  Parses a single check/filter XDoc file and adds its main, example, and property entries to the index.
  A parse failure here means the source XDoc itself is malformed, which is a real problem with the documentation rather than something safe to skip - so this fails the build instead of logging a warning and silently continuing.
  
  Parameters:
  
  xmlFile - the XDoc source file to process
  
  xdocsDir - xdocs root (used for URL building)
  
  category - category label for entries from this file
  
  type - document type ("Check", "Filter", "File Filter")
  
  Throws:
  
  IllegalStateException - if xmlFile cannot be parsed
- processGeneralPages
  
  private void processGeneralPages(File xdocsDir)
  
  Adds entries for the top-level general documentation pages.
  Each remaining page is indexed per top-level <section>, using the section's full text content for keyword extraction so page-internal headings are fully discoverable. Generic structural section names (see GENERIC_SECTION_NAMES) are disambiguated by prefixing the page's own title.
  
  Parameters:
  
  xdocsDir - the xdocs root directory
- processGeneralPage
  
  private void processGeneralPage(File xmlFile)
  
  Parses a single general-documentation XDoc page and adds its per-section entries to the index.
  A parse failure here means the source XDoc itself is malformed, so this fails the build instead of logging a warning and continuing.
  
  Parameters:
  
  xmlFile - the XDoc source file to process
  
  Throws:
  
  IllegalStateException - if xmlFile cannot be parsed
- buildMainEntry
  
  private static SearchIndexEntry buildMainEntry(Document doc, File xmlFile, String category, String type, String baseUrl)
  
  Builds the main search entry representing an entire check/filter document.
  
  Parameters:
  
  doc - the parsed XDoc document
  
  xmlFile - the source file
  
  category - category label for this file's entry
  
  type - document type ("Check", "Filter", etc.)
  
  baseUrl - the page url without anchor
  
  Returns:
  
  an entry representing the document
- buildGeneralPageEntries
  
  private static List<SearchIndexEntry> buildGeneralPageEntries(File xmlFile) throws ParserConfigurationException, SAXException, IOException
  
  Builds one search entry per top-level <section> in a general documentation page, using each section's full text for keyword extraction so that page-internal content is fully discoverable.
  Generic structural section names (see GENERIC_SECTION_NAMES) are disambiguated as "<page title>: <section name>" to avoid collisions across pages (e.g. "Eclipse IDE: Debug" vs "IntelliJ IDE: Debug").
  
  Parameters:
  
  xmlFile - the XDoc source file to parse
  
  Returns:
  
  list of entries, one per top-level section found
  
  Throws:
  
  ParserConfigurationException - on XML parser setup failure
  
  SAXException - on XML parse error
  
  IOException - on file read failure
- extractExampleEntries
  
  private static List<SearchIndexEntry> extractExampleEntries(Document doc, String baseUrl, String category)
  Extracts per-example search entries from a check/filter document.
  Both -config and -code example paragraphs are indexed so users can find both the configuration snippet and the corresponding Java code example independently in search results.
  
  Titles use the pattern "<CheckName>: Example1 [config]" and "<CheckName>: Example1 [code]" to make the type immediately visible in search result listings without needing to open the page.
  
  Confirmed XDoc template structure for the Examples subsection:
  
  <p id="Example1-config">To configure the check...</p> <macro name="example"><param name="type" value="config"/></macro> <p id="Example1-code">Example:</p> <macro name="example"><param name="type" value="code"/></macro>
  Parameters:
  
  doc - the parsed XDoc document
  
  baseUrl - the page url without anchor
  
  category - category label
  
  Returns:
  
  list of per-example entries (both config and code); empty if none found
- buildExampleEntry
  
  private static SearchIndexEntry buildExampleEntry(Element paragraph, String checkName, String baseUrl, String category)
  
  Builds a single example entry from a paragraph element.
  
  Parameters:
  
  paragraph - the paragraph element containing the example
  
  checkName - the name of the check
  
  baseUrl - the base URL for the page
  
  category - the category label
  
  Returns:
  
  a SearchIndexEntry if the paragraph matches the example pattern, null otherwise
- extractPropertyEntries
  
  private static List<SearchIndexEntry> extractPropertyEntries(Document doc, String baseUrl, String category)
  
  Extracts per-property search entries from a check/filter document.
  Each row of the Properties table is indexed under the title "<CheckName>: <propertyName>" and linked to the property's own anchor on the page.
  
  Parameters:
  
  doc - the parsed XDoc document
  
  baseUrl - the page url without anchor
  
  category - category label
  
  Returns:
  
  list of per-property entries; empty if none found
- extractPropertiesFromRows
  
  private static void extractPropertiesFromRows(Element propertiesSubsection, String checkName, String baseUrl, String category, List<SearchIndexEntry> propertyEntries)
  
  Extracts property entries from table rows and adds them to the list.
  
  Parameters:
  
  propertiesSubsection - the properties subsection element
  
  checkName - the check name
  
  baseUrl - the page url without anchor
  
  category - category label
  
  propertyEntries - the list to add entries to
- processPropertyRow
  
  private static void processPropertyRow(NodeList cells, String checkName, String baseUrl, String category, List<SearchIndexEntry> propertyEntries)
  
  Processes a single property row and adds an entry if valid.
  
  Parameters:
  
  cells - the table cells
  
  checkName - the check name
  
  baseUrl - the page url without anchor
  
  category - category label
  
  propertyEntries - the list to add entries to
- addIfNew
  
  private void addIfNew(SearchIndexEntry entry)
  
  Adds an entry to the output list only if its URL has not been seen before. This is a secondary guard that catches any duplicates that slip through the primary filter (only processing plain .xml files), e.g. if a check has the same example paragraph id repeated across two sections.
  
  Parameters:
  
  entry - the entry to conditionally add
- findSubsectionByPrefix
  
  private static Element findSubsectionByPrefix(Element section, String fragment)
  
  Finds a subsection within a section whose lowercased name contains the given fragment (e.g. "examples" or "propert" to match "Properties").
  
  Parameters:
  
  section - the section to search
  
  fragment - lowercase fragment to match against the subsection name
  
  Returns:
  
  the matching subsection element, or null if not found
- parseXml
  
  private static Document parseXml(File xmlFile) throws ParserConfigurationException, SAXException, IOException
  
  Parses the XML file into a Document with external entity resolution disabled for security.
  
  Parameters:
  
  xmlFile - the XDoc source file
  
  Returns:
  
  the parsed Document
  
  Throws:
  
  ParserConfigurationException - on XML parser setup failure
  
  SAXException - on XML parse error
  
  IOException - on file read failure
- requireBody
  
  private static Element requireBody(Document doc, String identifier)
  
  Returns the document's <body> element, failing fast if it is absent. Every XDoc page processed by this generator is expected to have one; its absence indicates a malformed source file that should be fixed rather than silently skipped or producing an empty entry.
  
  Parameters:
  
  doc - the parsed document
  
  identifier - file path or URL used to identify the source in the error message
  
  Returns:
  
  the body element
  
  Throws:
  
  IllegalStateException - if doc has no <body> element
- extractTitle
  
  private static String extractTitle(Document doc, File xmlFile, NodeList sections)
  
  Extracts the document title from the <title> element, falling back to the first non-empty, non-"Content" section name, and finally to a capitalised version of the file name.
  
  Parameters:
  
  doc - the document
  
  xmlFile - the source file
  
  sections - the list of sections
  
  Returns:
  
  the title string, never empty
- extractAggregateDescription
  
  private static String extractAggregateDescription(NodeList sections)
  
  Aggregates description from sections, taking the first non-empty Description subsection found across all sections in the document.
  
  Parameters:
  
  sections - list of sections
  
  Returns:
  
  description string, possibly empty
- extractAggregateKeywords
  
  private static String extractAggregateKeywords(String title, NodeList sections)
  
  Aggregates keywords from sections using all section text so that the main check entry is discoverable by any term in the document.
  
  Parameters:
  
  title - the document title
  
  sections - list of sections
  
  Returns:
  
  keywords string
- extractDescription
  
  private static String extractDescription(Element section)
  
  Extracts the first sentence of the Description subsection. Returns an empty string if no Description subsection is found.
  
  Parameters:
  
  section - the <section> element to search
  
  Returns:
  
  first sentence of the description, or empty string
- derivePageTitle
  
  private static String derivePageTitle(Document doc, File xmlFile)
  
  Derives a fallback page title from the document's <title> element or, failing that, from the filename.
  
  Parameters:
  
  doc - the parsed document
  
  xmlFile - the source file
  
  Returns:
  
  a non-empty title string
- disambiguateTitle
  
  private static String disambiguateTitle(String sectionName, String pageTitle)
  
  Disambiguates a section title when it is a generic, structurally repeated header (see GENERIC_SECTION_NAMES). Non-generic section names are returned unchanged.
  
  Parameters:
  
  sectionName - the raw section name
  
  pageTitle - the owning page's own title
  
  Returns:
  
  either sectionName unchanged, or "<pageTitle>: <sectionName>" if generic
- doxiaAnchorFor
  
  private static String doxiaAnchorFor(String sectionName)
  
  Converts a Doxia <section name="..."> value into the anchor id Doxia generates for it in the rendered HTML by replacing runs of whitespace with single underscores.
  
  Parameters:
  
  sectionName - the raw name attribute value
  
  Returns:
  
  the anchor id Doxia would render for this section name
- extractFirstSentenceOrTruncated
  
  private static String extractFirstSentenceOrTruncated(String text)
  
  Returns the first sentence of the given text (up to and including the first period), or the text truncated to MAX_DESCRIPTION_LENGTH with an ellipsis if no period is found within range.
  
  Parameters:
  
  text - the source text, already whitespace-normalised
  
  Returns:
  
  first sentence or truncated text
- truncate
  
  private static String truncate(String text, int maxLength)
  
  Truncates text to the given max length, appending an ellipsis if truncation occurred.
  
  Parameters:
  
  text - the text to truncate
  
  maxLength - maximum length before truncation
  
  Returns:
  
  original text if short enough, otherwise truncated with ellipsis
- buildUrl
  
  private static String buildUrl(File xmlFile, File xdocsDir)
  
  Builds the root-relative URL for an XDoc file, without any anchor. Always uses forward slashes regardless of OS.
  
  Parameters:
  
  xmlFile - the source XDoc file
  
  xdocsDir - the xdocs root directory
  
  Returns:
  
  root-relative URL string with no anchor
- resolvePageUrl
  
  private static String resolvePageUrl(File xmlFile, File xdocsDir)
  
  Resolves the correct URL for a general page file. For config_<category>.xml files that redirect to check category pages, maps to checks/<category>/index.html instead of the file path.
  
  Parameters:
  
  xmlFile - the source XDoc file
  
  xdocsDir - the xdocs root directory
  
  Returns:
  
  the resolved URL
- extractKeywordsFromText
  
  private static String extractKeywordsFromText(String text)
  
  Extracts keywords from free-form text by splitting on non-word characters and filtering short and stop words.
  
  Parameters:
  
  text - input text
  
  Returns:
  
  comma-separated keyword string (up to MAX_KEYWORDS words)
- writeJson
  
  private static void writeJson(List<SearchIndexEntry> indexEntries, Path outputFilePath) throws IOException
  
  Writes all index entries to the output file.
  
  Parameters:
  
  indexEntries - the list of entries to serialise
  
  outputFilePath - the full path to the output file
  
  Throws:
  
  IOException - on file write failure
- capitalise
  
  private static String capitalise(String input)
  
  Capitalises the first character of a string.
  
  Parameters:
  
  input - the string to capitalise
  
  Returns:
  
  string with first character uppercased, or input unchanged if empty

Class SearchIndexGenerator

Key design decisions

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

CHECKS

COMMA_STR

SPACE

SPACE_CHAR

TITLE_SEPARATOR

ELLIPSIS

EXTERNAL_GENERAL_ENTITIES

EXTERNAL_PARAMETER_ENTITIES

GENERAL

EXAMPLE_TYPE

PROPERTY_TYPE

SUBSECTION

NAME_ATTR

ID_ATTR

INDEX_XML

FILTERS_DIR

FILEFILTERS_DIR

INDEX_HTML

CONTENT

EXAMPLES_SUBSECTION

BODY

SECTION

TITLE

DESCRIPTION

ANCHOR_SEPARATOR

PATH_SEPARATOR

PROPERTIES_FRAGMENT

PARSE_FAILURE_MSG

EXAMPLE_LABEL_CONFIG

EXAMPLE_LABEL_CODE

MIN_WORD_LENGTH

MAX_KEYWORDS

MAX_DESCRIPTION_LENGTH

WHITESPACE

NON_ALPHANUMERIC

PLAIN_XML

DOC_EXTENSION

CONFIG_CATEGORY

EXAMPLE_PARAGRAPH_ID

GENERIC_SECTION_NAMES

CHECKS_CATEGORY_DISPLAY_NAMES

STOP_WORDS

entries

seenUrls

Constructor Details

SearchIndexGenerator

Method Details

main

execute

processChecksDirectory

processDirectory

processXmlFile

processGeneralPages

processGeneralPage

buildMainEntry

buildGeneralPageEntries

extractExampleEntries

buildExampleEntry

extractPropertyEntries

extractPropertiesFromRows

processPropertyRow

addIfNew

findSubsectionByPrefix

parseXml

requireBody

extractTitle

extractAggregateDescription

extractAggregateKeywords

extractDescription

derivePageTitle

disambiguateTitle

doxiaAnchorFor

extractFirstSentenceOrTruncated