sphinx_typesense.indexer¶

Content extraction and Typesense indexing. This module handles extracting searchable content from Sphinx HTML output and indexing it into Typesense collections using a DocSearch-compatible hierarchical schema.

Module Contents¶

Backward compatibility module for TypesenseIndexer.

This module maintains backward compatibility by re-exporting TypesenseBackend as TypesenseIndexer and exposing the schema and weight constants.

Deprecated since version Use: sphinx_typesense.backends.typesense.TypesenseBackend instead.

Example

New code should use:

from sphinx_typesense.backends.typesense import TypesenseBackend

backend = TypesenseBackend(app)
count = backend.index_all()

Legacy code using TypesenseIndexer continues to work:

from sphinx_typesense.indexer import TypesenseIndexer

indexer = TypesenseIndexer(app)
count = indexer.index_all()

class sphinx_typesense.indexer.TypesenseBackend[source]¶

Bases: SearchBackend

Typesense search backend implementation.

Provides server-based search using Typesense. Indexes documentation content at build time and provides frontend assets for DocSearch UI.

This class handles the complete indexing pipeline:

Initialize Typesense client from Sphinx config
Ensure collection exists with correct schema
Parse HTML files and extract hierarchical content
Create and bulk import documents

name¶

Backend identifier (“typesense”).

Type:: str

collection_name¶: Name of the Typesense collection.

name: str = 'typesense'¶

__init__(app)[source]¶

Initialize the Typesense backend.

Parameters:: app (Sphinx) – The Sphinx application instance.

property client: Client¶

Get or create the Typesense client.

Returns:: Configured Typesense client instance.

index_all()[source]¶

Index all HTML files from build output.

Performs connection validation before indexing. If the Typesense server is unavailable, returns 0 without failing the build.

Returns:: Number of documents indexed, or 0 if server unavailable.
Return type:: int

get_js_files()[source]¶

Return Typesense DocSearch JavaScript files.

Returns:: List of (filename, attributes) tuples for app.add_js_file().
Return type:: list[tuple[str, dict[str, str | int]]]

get_css_files()[source]¶

Return Typesense DocSearch CSS files.

Returns:: List of CSS filenames for app.add_css_file().
Return type:: list[str]

get_config_script()[source]¶

Return inline JavaScript configuration for DocSearch.

Returns:: JavaScript code setting window.TYPESENSE_CONFIG.
Return type:: str

is_available()[source]¶

Check if Typesense server is available.

Returns:: True if server is reachable and authenticated, False otherwise.
Return type:: bool

sphinx_typesense.indexer.index_documents(app, exception)[source]¶

Sphinx event handler to index documents after build.

This function is called by Sphinx after the build completes. It creates a TypesenseBackend and indexes all HTML files if no exception occurred.

Implements graceful degradation: if Typesense is unavailable, the build completes successfully with a warning. Search will not be available until the server is restored and docs are rebuilt.

Parameters:

app (Sphinx) – The Sphinx application instance.
exception (Exception | None) – Exception raised during build, if any.

TypesenseIndexer Usage¶

The main indexer class handles the complete indexing pipeline:

Initialize Typesense client from Sphinx config
Ensure collection exists with correct schema
Parse HTML files and extract hierarchical content
Create and bulk import documents

Usage Example:

from sphinx_typesense.indexer import TypesenseIndexer

# In a Sphinx event handler or build script
indexer = TypesenseIndexer(app)
count = indexer.index_all()
print(f"Indexed {count} documents")

Constants Reference¶

Collection Schema¶

The DOCS_SCHEMA constant defines the Typesense collection schema for documentation. It follows the DocSearch schema for frontend compatibility. Fields include:

hierarchy.lvl0-3: Hierarchical heading levels (faceted)
content: Paragraph/list text
url: Full URL with anchor
url_without_anchor: URL without fragment
anchor: Fragment identifier
type: Document type (lvl0, lvl1, content, etc.)
version: Documentation version (faceted)
language: Content language (faceted)
weight: Search ranking weight
item_priority: Default sorting priority

Document Weights¶

The DOC_TYPE_WEIGHTS and DOC_TYPE_PRIORITIES constants define weight values for search ranking by document type:

lvl0: 100 (page titles)
lvl1: 90 (h2 headings)
lvl2: 80 (h3 headings)
lvl3: 70 (h4 headings)
content: 50 (paragraphs and list items)

Indexing Process¶

The indexing process follows these steps:

Collection Setup: Create the Typesense collection if it does not exist, or drop and recreate if typesense_drop_existing is True.
HTML Parsing: Iterate through all .html files in the build output directory.
Content Extraction: For each HTML file:
- Find the main content element using theme-specific selectors
- Extract hierarchical structure (h1 > h2 > h3 > h4)
- Extract content from paragraphs and list items
- Generate unique document IDs using SHA256 hashes
Bulk Import: Import all documents to Typesense using the upsert action.

Document Structure¶

Each indexed document has the following structure:

{
    "id": "abc123...",                    # SHA256 hash
    "hierarchy.lvl0": "Page Title",       # h1
    "hierarchy.lvl1": "Section",          # h2
    "hierarchy.lvl2": "Subsection",       # h3
    "hierarchy.lvl3": "Sub-subsection",   # h4
    "content": "Paragraph text...",       # p, li
    "url": "page.html#anchor",
    "url_without_anchor": "page.html",
    "anchor": "anchor",
    "type": "content",                    # lvl0, lvl1, lvl2, lvl3, content
    "version": "1.0",
    "language": "en",
    "weight": 50,
    "item_priority": 50,
}