sphinx_typesense.indexer¶
Content extraction and Typesense indexing. This module handles extracting searchable content from Sphinx HTML output and indexing it into Typesense collections using a DocSearch-compatible hierarchical schema.
Module Contents¶
Backward compatibility module for TypesenseIndexer.
This module maintains backward compatibility by re-exporting TypesenseBackend as TypesenseIndexer and exposing the schema and weight constants.
Deprecated since version Use: sphinx_typesense.backends.typesense.TypesenseBackend instead.
Example
New code should use:
from sphinx_typesense.backends.typesense import TypesenseBackend
backend = TypesenseBackend(app)
count = backend.index_all()
Legacy code using TypesenseIndexer continues to work:
from sphinx_typesense.indexer import TypesenseIndexer
indexer = TypesenseIndexer(app)
count = indexer.index_all()
- class sphinx_typesense.indexer.TypesenseBackend[source]¶
Bases:
SearchBackendTypesense search backend implementation.
Provides server-based search using Typesense. Indexes documentation content at build time and provides frontend assets for DocSearch UI.
- This class handles the complete indexing pipeline:
Initialize Typesense client from Sphinx config
Ensure collection exists with correct schema
Parse HTML files and extract hierarchical content
Create and bulk import documents
- collection_name¶
Name of the Typesense collection.
- __init__(app)[source]¶
Initialize the Typesense backend.
- Parameters:
app (Sphinx) – The Sphinx application instance.
- property client: Client¶
Get or create the Typesense client.
- Returns:
Configured Typesense client instance.
- index_all()[source]¶
Index all HTML files from build output.
Performs connection validation before indexing. If the Typesense server is unavailable, returns 0 without failing the build.
- Returns:
Number of documents indexed, or 0 if server unavailable.
- Return type:
- sphinx_typesense.indexer.index_documents(app, exception)[source]¶
Sphinx event handler to index documents after build.
This function is called by Sphinx after the build completes. It creates a TypesenseBackend and indexes all HTML files if no exception occurred.
Implements graceful degradation: if Typesense is unavailable, the build completes successfully with a warning. Search will not be available until the server is restored and docs are rebuilt.
- Parameters:
app (Sphinx) – The Sphinx application instance.
exception (Exception | None) – Exception raised during build, if any.
TypesenseIndexer Usage¶
The main indexer class handles the complete indexing pipeline:
Initialize Typesense client from Sphinx config
Ensure collection exists with correct schema
Parse HTML files and extract hierarchical content
Create and bulk import documents
Usage Example:
from sphinx_typesense.indexer import TypesenseIndexer
# In a Sphinx event handler or build script
indexer = TypesenseIndexer(app)
count = indexer.index_all()
print(f"Indexed {count} documents")
Constants Reference¶
Collection Schema¶
The DOCS_SCHEMA constant defines the Typesense collection schema for
documentation. It follows the DocSearch schema for frontend compatibility.
Fields include:
hierarchy.lvl0-3: Hierarchical heading levels (faceted)content: Paragraph/list texturl: Full URL with anchorurl_without_anchor: URL without fragmentanchor: Fragment identifiertype: Document type (lvl0, lvl1, content, etc.)version: Documentation version (faceted)language: Content language (faceted)weight: Search ranking weightitem_priority: Default sorting priority
Document Weights¶
The DOC_TYPE_WEIGHTS and DOC_TYPE_PRIORITIES constants define
weight values for search ranking by document type:
lvl0: 100 (page titles)lvl1: 90 (h2 headings)lvl2: 80 (h3 headings)lvl3: 70 (h4 headings)content: 50 (paragraphs and list items)
Indexing Process¶
The indexing process follows these steps:
Collection Setup: Create the Typesense collection if it does not exist, or drop and recreate if
typesense_drop_existingis True.HTML Parsing: Iterate through all
.htmlfiles in the build output directory.Content Extraction: For each HTML file:
Find the main content element using theme-specific selectors
Extract hierarchical structure (h1 > h2 > h3 > h4)
Extract content from paragraphs and list items
Generate unique document IDs using SHA256 hashes
Bulk Import: Import all documents to Typesense using the upsert action.
Document Structure¶
Each indexed document has the following structure:
{
"id": "abc123...", # SHA256 hash
"hierarchy.lvl0": "Page Title", # h1
"hierarchy.lvl1": "Section", # h2
"hierarchy.lvl2": "Subsection", # h3
"hierarchy.lvl3": "Sub-subsection", # h4
"content": "Paragraph text...", # p, li
"url": "page.html#anchor",
"url_without_anchor": "page.html",
"anchor": "anchor",
"type": "content", # lvl0, lvl1, lvl2, lvl3, content
"version": "1.0",
"language": "en",
"weight": 50,
"item_priority": 50,
}