sphinx_typesense.indexer

Content extraction and Typesense indexing. This module handles extracting searchable content from Sphinx HTML output and indexing it into Typesense collections using a DocSearch-compatible hierarchical schema.

Module Contents

Backward compatibility module for TypesenseIndexer.

This module maintains backward compatibility by re-exporting TypesenseBackend as TypesenseIndexer and exposing the schema and weight constants.

Deprecated since version Use: sphinx_typesense.backends.typesense.TypesenseBackend instead.

Example

New code should use:

from sphinx_typesense.backends.typesense import TypesenseBackend

backend = TypesenseBackend(app)
count = backend.index_all()

Legacy code using TypesenseIndexer continues to work:

from sphinx_typesense.indexer import TypesenseIndexer

indexer = TypesenseIndexer(app)
count = indexer.index_all()
class sphinx_typesense.indexer.TypesenseBackend[source]

Bases: SearchBackend

Typesense search backend implementation.

Provides server-based search using Typesense. Indexes documentation content at build time and provides frontend assets for DocSearch UI.

This class handles the complete indexing pipeline:
  1. Initialize Typesense client from Sphinx config

  2. Ensure collection exists with correct schema

  3. Parse HTML files and extract hierarchical content

  4. Create and bulk import documents

name

Backend identifier (“typesense”).

Type:

str

collection_name

Name of the Typesense collection.

name: str = 'typesense'
__init__(app)[source]

Initialize the Typesense backend.

Parameters:

app (Sphinx) – The Sphinx application instance.

property client: Client

Get or create the Typesense client.

Returns:

Configured Typesense client instance.

index_all()[source]

Index all HTML files from build output.

Performs connection validation before indexing. If the Typesense server is unavailable, returns 0 without failing the build.

Returns:

Number of documents indexed, or 0 if server unavailable.

Return type:

int

get_js_files()[source]

Return Typesense DocSearch JavaScript files.

Returns:

List of (filename, attributes) tuples for app.add_js_file().

Return type:

list[tuple[str, dict[str, str | int]]]

get_css_files()[source]

Return Typesense DocSearch CSS files.

Returns:

List of CSS filenames for app.add_css_file().

Return type:

list[str]

get_config_script()[source]

Return inline JavaScript configuration for DocSearch.

Returns:

JavaScript code setting window.TYPESENSE_CONFIG.

Return type:

str

is_available()[source]

Check if Typesense server is available.

Returns:

True if server is reachable and authenticated, False otherwise.

Return type:

bool

sphinx_typesense.indexer.index_documents(app, exception)[source]

Sphinx event handler to index documents after build.

This function is called by Sphinx after the build completes. It creates a TypesenseBackend and indexes all HTML files if no exception occurred.

Implements graceful degradation: if Typesense is unavailable, the build completes successfully with a warning. Search will not be available until the server is restored and docs are rebuilt.

Parameters:
  • app (Sphinx) – The Sphinx application instance.

  • exception (Exception | None) – Exception raised during build, if any.

TypesenseIndexer Usage

The main indexer class handles the complete indexing pipeline:

  1. Initialize Typesense client from Sphinx config

  2. Ensure collection exists with correct schema

  3. Parse HTML files and extract hierarchical content

  4. Create and bulk import documents

Usage Example:

from sphinx_typesense.indexer import TypesenseIndexer

# In a Sphinx event handler or build script
indexer = TypesenseIndexer(app)
count = indexer.index_all()
print(f"Indexed {count} documents")

Constants Reference

Collection Schema

The DOCS_SCHEMA constant defines the Typesense collection schema for documentation. It follows the DocSearch schema for frontend compatibility. Fields include:

  • hierarchy.lvl0-3: Hierarchical heading levels (faceted)

  • content: Paragraph/list text

  • url: Full URL with anchor

  • url_without_anchor: URL without fragment

  • anchor: Fragment identifier

  • type: Document type (lvl0, lvl1, content, etc.)

  • version: Documentation version (faceted)

  • language: Content language (faceted)

  • weight: Search ranking weight

  • item_priority: Default sorting priority

Document Weights

The DOC_TYPE_WEIGHTS and DOC_TYPE_PRIORITIES constants define weight values for search ranking by document type:

  • lvl0: 100 (page titles)

  • lvl1: 90 (h2 headings)

  • lvl2: 80 (h3 headings)

  • lvl3: 70 (h4 headings)

  • content: 50 (paragraphs and list items)

Indexing Process

The indexing process follows these steps:

  1. Collection Setup: Create the Typesense collection if it does not exist, or drop and recreate if typesense_drop_existing is True.

  2. HTML Parsing: Iterate through all .html files in the build output directory.

  3. Content Extraction: For each HTML file:

    • Find the main content element using theme-specific selectors

    • Extract hierarchical structure (h1 > h2 > h3 > h4)

    • Extract content from paragraphs and list items

    • Generate unique document IDs using SHA256 hashes

  4. Bulk Import: Import all documents to Typesense using the upsert action.

Document Structure

Each indexed document has the following structure:

{
    "id": "abc123...",                    # SHA256 hash
    "hierarchy.lvl0": "Page Title",       # h1
    "hierarchy.lvl1": "Section",          # h2
    "hierarchy.lvl2": "Subsection",       # h3
    "hierarchy.lvl3": "Sub-subsection",   # h4
    "content": "Paragraph text...",       # p, li
    "url": "page.html#anchor",
    "url_without_anchor": "page.html",
    "anchor": "anchor",
    "type": "content",                    # lvl0, lvl1, lvl2, lvl3, content
    "version": "1.0",
    "language": "en",
    "weight": 50,
    "item_priority": 50,
}