Class SearchIndexGenerator
search-index.json from the Checkstyle XDoc source files.
This is a plain Java main() class - no Maven plugin API required.
It is invoked by exec-maven-plugin during the process-classes
phase so the index is ready when Maven Site copies static resources.
Output is written as a JSON file. The search widget fetches this file using the fetch API and parses it to populate the search index.
Key design decisions
- No duplicates. Only plain
.xmlfiles are processed for check/filter/filefilter directories. The.xml.templateand.xml.vmsiblings are pre-render source files that would produce identical URLs and duplicate entries. A secondary URL-keyed dedup guard is also applied across the entire output list. - Identifiable example titles. Both
-configand-codeexample paragraphs are indexed. Their titles use the pattern"<CheckName>: Example1 [config]"and"<CheckName>: Example1 [code]"so users can distinguish a configuration snippet from its matching Java code example in search results. - Full general-page indexing. Each meaningful
<section>in general documentation pages (e.g.config_system_properties,writingchecks,cmdline) is indexed as its own entry with the full section text used for keyword extraction - not just the first sentence. This makes page-internal headings discoverable. - Disambiguated generic titles. Structural section names that are repeated across many pages (e.g. "Overview", "Debug", "Contributing") are prefixed with the page title, yielding e.g. "Eclipse IDE: Debug" instead of a bare "Debug" that collides with "IntelliJ IDE: Debug".
- Junk pages excluded. Release notes, auto-generated style coverage reports and bare category aggregator stubs are skipped.
Usage (called by exec-maven-plugin in pom.xml):
java SearchIndexGenerator <xdocsDir> <outputFilePath> java SearchIndexGenerator src/site/xdoc target/site/search-index.json
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final StringString literal for anchor separator.private static final StringString literal for body element.Category mapping: XDoc subdirectory name to display label.private static final StringString literal for checks directory.private static final StringString literal for comma.private static final PatternMatchesconfig_<category>.xmlfiles that redirect to check category pages.private static final StringString literal for Content.private static final StringString literal for description element.private static final PatternMatches.xml,.xml.vmand.xml.template.private static final StringString literal for ellipsis.private List<SearchIndexEntry> Accumulated search index entries.private static final StringSuffix label appended to example titles for Java code examples.private static final StringSuffix label appended to example titles for configuration snippets.private static final PatternMatches an example paragraphidattribute that has a suffix of either-configor-code, capturing the base label (e.g.private static final StringString literal for Example document type.private static final StringString literal for the Examples subsection name.private static final StringString literal for external general entities feature.private static final StringString literal for external parameter entities feature.private static final StringString literal for General category.Generic section/subsection names that are structurally repeated across many unrelated general pages (IDE setup guides, writing-* guides, etc).private static final StringString literal for id attribute.private static final StringString literal for index.xml.private final LoggerLogger for this class.private static final intMagic number for maximum description length.private static final intMagic number for maximum keywords.private static final intMagic number for minimum word length.private static final StringString literal for name attribute.private static final PatternNon-alphanumeric pattern.private static final PatternMatches only plain.xmlfiles (not.xml.vmor.xml.template).private static final StringString literal for the Properties subsection name fragment.private static final StringString literal for Property document type.private static final StringString literal for section element.Deduplication guard for URLs.private static final StringLog message for skipping files.private static final StringString literal for space.private static final charCharacter literal for space.Stop words: too generic to be useful as search keywords.private static final StringString literal for subsection element.private static final StringString literal for title element.private static final StringString literal for colon separator used in disambiguated titles.private static final PatternWhitespace pattern. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate voidaddIfNew(SearchIndexEntry entry) Adds an entry to the output list only if its URL has not been seen before.private static SearchIndexEntrybuildExampleEntry(Element paragraph, String checkName, String baseUrl, String category) Builds a single example entry from a paragraph element.private static List<SearchIndexEntry> buildGeneralPageEntries(File xmlFile) Builds one search entry per top-level<section>in a general documentation page, using each section's full text for keyword extraction so that page-internal content is fully discoverable.private static SearchIndexEntryBuilds the main search entry representing an entire check/filter document.private static StringBuilds the root-relative URL for an XDoc file, without any anchor.private static Stringcapitalise(String input) Capitalises the first character of a string.private static StringderivePageTitle(Document doc, File xmlFile) Derives a fallback page title from the document's<title>element or, failing that, from the filename.private static StringdisambiguateTitle(String sectionName, String pageTitle) Disambiguates a section title when it is a generic, structurally repeated header (seeGENERIC_SECTION_NAMES).private static StringdoxiaAnchorFor(String sectionName) Converts a Doxia<section name="...">value into the anchor id Doxia generates for it in the rendered HTML by replacing runs of whitespace with single underscores.private voidInternal execution method to avoid static context for the logger.private static StringextractAggregateDescription(NodeList sections) Aggregates description from sections, taking the first non-empty Description subsection found across all sections in the document.private static StringextractAggregateKeywords(String title, NodeList sections) Aggregates keywords from sections using all section text so that the main check entry is discoverable by any term in the document.private static StringextractDescription(Element section) Extracts the first sentence of the Description subsection.private static List<SearchIndexEntry> extractExampleEntries(Document doc, String baseUrl, String category) Extracts per-example search entries from a check/filter document.private static StringReturns the first sentence of the given text (up to and including the first period), or the text truncated toMAX_DESCRIPTION_LENGTHwith an ellipsis if no period is found within range.private static StringExtracts keywords from free-form text by splitting on non-word characters and filtering short and stop words.private static voidextractPropertiesFromRows(Element propertiesSubsection, String checkName, String baseUrl, String category, List<SearchIndexEntry> propertyEntries) Extracts property entries from table rows and adds them to the list.private static List<SearchIndexEntry> extractPropertyEntries(Document doc, String baseUrl, String category) Extracts per-property search entries from a check/filter document.private static StringextractTitle(Document doc, File xmlFile, NodeList sections) Extracts the document title from the<title>element, falling back to the first non-empty, non-"Content" section name, and finally to a capitalised version of the file name.private static ElementfindSubsectionByPrefix(Element section, String fragment) Finds a subsection within a section whose lowercased name contains the given fragment (e.g.static voidMain entry point called by exec-maven-plugin.private static DocumentParses the XML file into a Document with external entity resolution disabled for security.private voidprocessChecksDirectory(File checksDir, File xdocsDir) Walkssrc/xdocs/checks/and processes each category subdirectory.private voidprocessDirectory(File dir, File xdocsDir, String category, String type) Processes all plain.xmlfiles in a directory (non-recursive).private voidprocessGeneralPages(File xdocsDir) Adds entries for the top-level general documentation pages.private static voidprocessPropertyRow(NodeList cells, String checkName, String baseUrl, String category, List<SearchIndexEntry> propertyEntries) Processes a single property row and adds an entry if valid.private static StringresolvePageUrl(File xmlFile, File xdocsDir) Resolves the correct URL for a general page file.private static StringTruncates text to the given max length, appending an ellipsis if truncation occurred.private voidwriteJson(List<SearchIndexEntry> indexEntries, Path outputFilePath) Writes all index entries to the output file.
-
Field Details
-
CHECKS
String literal for checks directory.- See Also:
-
COMMA_STR
String literal for comma.- See Also:
-
SPACE
String literal for space.- See Also:
-
SPACE_CHAR
Character literal for space.- See Also:
-
TITLE_SEPARATOR
String literal for colon separator used in disambiguated titles.- See Also:
-
ELLIPSIS
String literal for ellipsis.- See Also:
-
EXTERNAL_GENERAL_ENTITIES
String literal for external general entities feature.- See Also:
-
EXTERNAL_PARAMETER_ENTITIES
String literal for external parameter entities feature.- See Also:
-
GENERAL
String literal for General category.- See Also:
-
EXAMPLE_TYPE
String literal for Example document type.- See Also:
-
PROPERTY_TYPE
String literal for Property document type.- See Also:
-
SUBSECTION
String literal for subsection element.- See Also:
-
NAME_ATTR
String literal for name attribute.- See Also:
-
ID_ATTR
String literal for id attribute.- See Also:
-
INDEX_XML
String literal for index.xml.- See Also:
-
CONTENT
String literal for Content.- See Also:
-
EXAMPLES_SUBSECTION
String literal for the Examples subsection name.- See Also:
-
BODY
String literal for body element.- See Also:
-
SECTION
String literal for section element.- See Also:
-
TITLE
String literal for title element.- See Also:
-
DESCRIPTION
String literal for description element.- See Also:
-
ANCHOR_SEPARATOR
String literal for anchor separator.- See Also:
-
PROPERTIES_FRAGMENT
String literal for the Properties subsection name fragment.- See Also:
-
SKIPPING_MSG
Log message for skipping files.- See Also:
-
EXAMPLE_LABEL_CONFIG
Suffix label appended to example titles for configuration snippets. Yields e.g. "AnnotationLocation: Example1 [config]".- See Also:
-
EXAMPLE_LABEL_CODE
Suffix label appended to example titles for Java code examples. Yields e.g. "AnnotationLocation: Example1 [code]".- See Also:
-
MIN_WORD_LENGTH
Magic number for minimum word length.- See Also:
-
MAX_KEYWORDS
Magic number for maximum keywords.- See Also:
-
MAX_DESCRIPTION_LENGTH
Magic number for maximum description length.- See Also:
-
WHITESPACE
Whitespace pattern. -
NON_ALPHANUMERIC
Non-alphanumeric pattern. -
PLAIN_XML
Matches only plain.xmlfiles (not.xml.vmor.xml.template). Used when scanning check/filter/filefilter directories to avoid processing pre-render source templates and producing duplicate index entries. -
DOC_EXTENSION
Matches.xml,.xml.vmand.xml.template. Used only for URL building (stripping the extension to produce a.htmlpath) and for the general-pages scanner where we want to exclude templates by name rather than by extension. -
CONFIG_CATEGORY
Matchesconfig_<category>.xmlfiles that redirect to check category pages. Captures the category name (e.g. "metrics" from "config_metrics.xml") in group 1. -
EXAMPLE_PARAGRAPH_ID
Matches an example paragraphidattribute that has a suffix of either-configor-code, capturing the base label (e.g. "Example1") in group 1 and the type ("config" or "code") in group 2.Example ids found in XDoc source:
id="Example1-config"-> label "Example1", type "config"id="Example1-code"-> label "Example1", type "code"
-
GENERIC_SECTION_NAMES
Generic section/subsection names that are structurally repeated across many unrelated general pages (IDE setup guides, writing-* guides, etc). On their own they are meaningless in search results ("Debug" appears identically in eclipse.xml, idea.xml, and netbeans.xml) so when one of these is used as a section title it is always disambiguated with the source page's own title, e.g. "Eclipse IDE: Debug". -
CATEGORY_MAP
Category mapping: XDoc subdirectory name to display label. -
STOP_WORDS
Stop words: too generic to be useful as search keywords. -
logger
Logger for this class. -
entries
Accumulated search index entries. -
seenUrls
Deduplication guard for URLs.
-
-
Constructor Details
-
SearchIndexGenerator
private SearchIndexGenerator()Prevent instantiation.
-
-
Method Details
-
main
Main entry point called by exec-maven-plugin.- Parameters:
args- args[0] = path to src/xdocs, args[1] = path to target/site- Throws:
IOException- on file write failureIllegalArgumentException- if args are missingIllegalStateException- if xdocsDir is missing
-
execute
Internal execution method to avoid static context for the logger.- Parameters:
args- args[0] = path to src/xdocs, args[1] = output file path- Throws:
IOException- on file write failureIllegalArgumentException- if args are missingIllegalStateException- if xdocsDir is missing
-
processChecksDirectory
Walkssrc/xdocs/checks/and processes each category subdirectory.- Parameters:
checksDir- the checks root directoryxdocsDir- the xdocs root (used for URL building)
-
processDirectory
Processes all plain.xmlfiles in a directory (non-recursive).index.xmlfiles and any file whose name ends with.xml.templateor.xml.vmare skipped.Skipping templates is critical: every check page has a sibling
*.xml.templatefile that resolves to the same HTML URL. Without this filter both files would be processed, producing two identical (or near-identical) main entries plus doubled example and property entries for every check.For each plain
.xmlfile, the main check/filter entry, per-example entries (both config and code), and per-property entries are added.- Parameters:
dir- directory to scanxdocsDir- xdocs root (used for URL building)category- category label for all entries in this directorytype- document type ("Check", "Filter", "File Filter")
-
processGeneralPages
Adds entries for the top-level general documentation pages.Each remaining page is indexed per top-level
<section>, using the section's full text content for keyword extraction so page-internal headings are fully discoverable. Generic structural section names (seeGENERIC_SECTION_NAMES) are disambiguated by prefixing the page's own title.- Parameters:
xdocsDir- the xdocs root directory
-
buildMainEntry
private static SearchIndexEntry buildMainEntry(Document doc, File xmlFile, String category, String type, String baseUrl) Builds the main search entry representing an entire check/filter document.- Parameters:
doc- the parsed XDoc documentxmlFile- the source filecategory- category label for this file's entrytype- document type ("Check", "Filter", etc.)baseUrl- the page url without anchor- Returns:
- an entry representing the document
-
buildGeneralPageEntries
private static List<SearchIndexEntry> buildGeneralPageEntries(File xmlFile) throws ParserConfigurationException, SAXException, IOException Builds one search entry per top-level<section>in a general documentation page, using each section's full text for keyword extraction so that page-internal content is fully discoverable.Generic structural section names (see
GENERIC_SECTION_NAMES) are disambiguated as"<page title>: <section name>"to avoid collisions across pages (e.g. "Eclipse IDE: Debug" vs "IntelliJ IDE: Debug").- Parameters:
xmlFile- the XDoc source file to parse- Returns:
- list of entries, one per top-level section found
- Throws:
ParserConfigurationException- on XML parser setup failureSAXException- on XML parse errorIOException- on file read failure
-
extractExampleEntries
private static List<SearchIndexEntry> extractExampleEntries(Document doc, String baseUrl, String category) Extracts per-example search entries from a check/filter document.Both
-configand-codeexample paragraphs are indexed so users can find both the configuration snippet and the corresponding Java code example independently in search results.Titles use the pattern
"<CheckName>: Example1 [config]"and"<CheckName>: Example1 [code]"to make the type immediately visible in search result listings without needing to open the page.Confirmed XDoc template structure for the Examples subsection:
<p id="Example1-config">To configure the check...</p> <macro name="example"><param name="type" value="config"/></macro> <p id="Example1-code">Example:</p> <macro name="example"><param name="type" value="code"/></macro>
- Parameters:
doc- the parsed XDoc documentbaseUrl- the page url without anchorcategory- category label- Returns:
- list of per-example entries (both config and code); empty if none found
-
buildExampleEntry
private static SearchIndexEntry buildExampleEntry(Element paragraph, String checkName, String baseUrl, String category) Builds a single example entry from a paragraph element.- Parameters:
paragraph- the paragraph element containing the examplecheckName- the name of the checkbaseUrl- the base URL for the pagecategory- the category label- Returns:
- a SearchIndexEntry if the paragraph matches the example pattern, null otherwise
-
extractPropertyEntries
private static List<SearchIndexEntry> extractPropertyEntries(Document doc, String baseUrl, String category) Extracts per-property search entries from a check/filter document.Each row of the Properties table is indexed under the title
"<CheckName>: <propertyName>"and linked to the property's own anchor on the page.- Parameters:
doc- the parsed XDoc documentbaseUrl- the page url without anchorcategory- category label- Returns:
- list of per-property entries; empty if none found
-
extractPropertiesFromRows
private static void extractPropertiesFromRows(Element propertiesSubsection, String checkName, String baseUrl, String category, List<SearchIndexEntry> propertyEntries) Extracts property entries from table rows and adds them to the list.- Parameters:
propertiesSubsection- the properties subsection elementcheckName- the check namebaseUrl- the page url without anchorcategory- category labelpropertyEntries- the list to add entries to
-
processPropertyRow
private static void processPropertyRow(NodeList cells, String checkName, String baseUrl, String category, List<SearchIndexEntry> propertyEntries) Processes a single property row and adds an entry if valid.- Parameters:
cells- the table cellscheckName- the check namebaseUrl- the page url without anchorcategory- category labelpropertyEntries- the list to add entries to
-
addIfNew
Adds an entry to the output list only if its URL has not been seen before. This is a secondary guard that catches any duplicates that slip through the primary filter (only processing plain.xmlfiles), e.g. if a check has the same example paragraph id repeated across two sections.- Parameters:
entry- the entry to conditionally add
-
findSubsectionByPrefix
Finds a subsection within a section whose lowercased name contains the given fragment (e.g. "examples" or "propert" to match "Properties").- Parameters:
section- the section to searchfragment- lowercase fragment to match against the subsection name- Returns:
- the matching subsection element, or
nullif not found
-
parseXml
private static Document parseXml(File xmlFile) throws ParserConfigurationException, SAXException, IOException Parses the XML file into a Document with external entity resolution disabled for security.- Parameters:
xmlFile- the XDoc source file- Returns:
- the parsed Document
- Throws:
ParserConfigurationException- on XML parser setup failureSAXException- on XML parse errorIOException- on file read failure
-
extractTitle
Extracts the document title from the<title>element, falling back to the first non-empty, non-"Content" section name, and finally to a capitalised version of the file name.- Parameters:
doc- the documentxmlFile- the source filesections- the list of sections- Returns:
- the title string, never empty
-
extractAggregateDescription
Aggregates description from sections, taking the first non-empty Description subsection found across all sections in the document.- Parameters:
sections- list of sections- Returns:
- description string, possibly empty
-
extractAggregateKeywords
Aggregates keywords from sections using all section text so that the main check entry is discoverable by any term in the document.- Parameters:
title- the document titlesections- list of sections- Returns:
- keywords string
-
extractDescription
Extracts the first sentence of the Description subsection. Returns an empty string if no Description subsection is found.- Parameters:
section- the<section>element to search- Returns:
- first sentence of the description, or empty string
-
derivePageTitle
Derives a fallback page title from the document's<title>element or, failing that, from the filename.- Parameters:
doc- the parsed documentxmlFile- the source file- Returns:
- a non-empty title string
-
disambiguateTitle
Disambiguates a section title when it is a generic, structurally repeated header (seeGENERIC_SECTION_NAMES). Non-generic section names are returned unchanged.- Parameters:
sectionName- the raw section namepageTitle- the owning page's own title- Returns:
- either
sectionNameunchanged, or"<pageTitle>: <sectionName>"if generic
-
doxiaAnchorFor
Converts a Doxia<section name="...">value into the anchor id Doxia generates for it in the rendered HTML by replacing runs of whitespace with single underscores.- Parameters:
sectionName- the rawnameattribute value- Returns:
- the anchor id Doxia would render for this section name
-
extractFirstSentenceOrTruncated
Returns the first sentence of the given text (up to and including the first period), or the text truncated toMAX_DESCRIPTION_LENGTHwith an ellipsis if no period is found within range.- Parameters:
text- the source text, already whitespace-normalised- Returns:
- first sentence or truncated text
-
truncate
Truncates text to the given max length, appending an ellipsis if truncation occurred.- Parameters:
text- the text to truncatemaxLength- maximum length before truncation- Returns:
- original text if short enough, otherwise truncated with ellipsis
-
buildUrl
Builds the root-relative URL for an XDoc file, without any anchor. Always uses forward slashes regardless of OS.- Parameters:
xmlFile- the source XDoc filexdocsDir- the xdocs root directory- Returns:
- root-relative URL string with no anchor
-
resolvePageUrl
Resolves the correct URL for a general page file. Forconfig_<category>.xmlfiles that redirect to check category pages, maps tochecks/<category>/index.htmlinstead of the file path.- Parameters:
xmlFile- the source XDoc filexdocsDir- the xdocs root directory- Returns:
- the resolved URL
-
extractKeywordsFromText
Extracts keywords from free-form text by splitting on non-word characters and filtering short and stop words.- Parameters:
text- input text- Returns:
- comma-separated keyword string (up to
MAX_KEYWORDSwords)
-
writeJson
Writes all index entries to the output file.- Parameters:
indexEntries- the list of entries to serialiseoutputFilePath- the full path to the output file- Throws:
IOException- on file write failure
-
capitalise
Capitalises the first character of a string.- Parameters:
input- the string to capitalise- Returns:
- string with first character uppercased, or input unchanged if empty
-