BISCUIT: Bitmap Indexed Searching with Combinatorial Union and Intersection Techniques

BISCUIT: Bitmap Indexed Searching with Combinatorial Union and Intersection Techniques

PostgreSQL Extension for High-Performance Wildcard Pattern Matching

Version 1.0.1

Executive Summary
Technical Overview
Installation and Requirements
Configuration and Setup
Query Interface
Pattern Matching Specification
CRUD Operations and Index Maintenance
Architecture and Implementation
API Reference
Monitoring and Diagnostics
Optimization Guidelines
Limitations and Constraints
Troubleshooting
Examples and Use Cases

Executive Summary

BISCUIT is a PostgreSQL C extension that provides specialized bitmap indexing for wildcard pattern matching operations. The extension addresses a fundamental limitation in traditional database indexes: efficient handling of leading wildcard queries (e.g., LIKE '%pattern'), which typically require full table scans.

Key Capabilities

High-Performance Wildcard Matching: Achieves O(1) to O(log n) query performance regardless of wildcard placement
Automatic Index Maintenance: Transparent CRUD synchronization via PostgreSQL triggers
Memory-Efficient Storage: Compressed bitmap structures using Roaring Bitmap algorithm
Lazy Deletion Strategy: Deferred cleanup with intelligent batching for optimal throughput
Incremental Update Optimization: Minimized reindexing overhead for similar string modifications
Universal Primary Key Support: Compatible with any PostgreSQL data type
Production-Ready Design: Comprehensive error handling, statistics tracking, and diagnostic capabilities

Performance Characteristics

Technical Overview

Problem Statement

Traditional database indexes (B-tree, GiST, GIN) exhibit suboptimal performance for wildcard pattern matching queries, particularly when wildcards appear at the beginning of patterns. A query such as SELECT * FROM table WHERE column LIKE '%pattern%' typically degrades to a sequential scan with O(n) complexity, where n represents the total number of rows.

Solution Architecture

BISCUIT implements a multi-dimensional bitmap index structure that decomposes strings into positional character mappings. Each character at each position is tracked using compressed bitmaps (Roaring Bitmaps), enabling constant-time intersection operations for pattern matching.

Core Index Structures

Positional Character Index: Maps each character to a sorted array of position-bitmap pairs
- Forward positions: {0, 1, 2, ..., n} from string start
- Reverse positions: {-1, -2, -3, ..., -n} from string end
Character Existence Cache: Union of all positional bitmaps per character for O(1) containment queries
Length Index: Bitmap array indexed by string length for length-constrained queries
Primary Key Hash Table: O(1) lookup structure mapping primary keys to internal indices
Lazy Deletion Layer: Tombstone bitmap tracking deleted records with deferred cleanup

Algorithmic Approach

Pattern Parsing: Input patterns are decomposed into literal segments separated by wildcards (%). Each segment’s position constraints (prefix, suffix, or floating) are analyzed.

Query Optimization: Pattern structure determines execution strategy:

Single-character queries: Direct bitmap lookup
Prefix patterns (abc%): Position-0 intersection with length filter
Suffix patterns (%xyz): Negative-position intersection with length filter
Containment patterns (%abc%): Character cache intersection
Multi-part patterns: Windowed position matching with recursive refinement

Bitmap Operations: Compressed bitmaps enable efficient set operations:

Intersection: AND operation for constraint combination
Union: OR operation for alternative matches
Difference: ANDNOT operation for tombstone filtering

Data Flow

Query Pattern → Parse → Optimize → Bitmap Ops → Filter Tombstones → Result Set
                                                                          ↓
Insert/Update/Delete → Hash Lookup → Bitmap Update → Tombstone Management

Installation and Requirements

System Requirements

PostgreSQL Version: 11.0 or higher
Operating System: Linux, macOS, Windows (with MinGW)
Architecture: x86_64, ARM64
Memory: Minimum 512MB available RAM per index
Compiler: GCC 4.8+, Clang 3.9+, or MSVC 2017+

Optional Dependencies

CRoaring Library: For optimized Roaring Bitmap operations (recommended)
- If unavailable, extension falls back to built-in bitmap implementation
- Installation: apt-get install libroaring-dev (Debian/Ubuntu)

Installation Procedure

Method 1: PostgreSQL Extension System

CREATE EXTENSION biscuit;

This command must be executed by a superuser or a user with CREATE EXTENSION privileges.

Method 2: Manual Compilation

# Clone repository
git clone https://github.com/crystallinecore/biscuit.git
cd biscuit

# Compile
make

# Install (requires superuser privileges)
sudo make install

# Load extension
psql -d your_database -c "CREATE EXTENSION biscuit;"

Verification

SELECT biscuit_version();
-- Expected output: 1.0.1-Biscuit

Configuration and Setup

Initialization Options

Option 1: Automated Setup (Recommended)

The biscuit_setup() function provides a complete initialization workflow:

SELECT biscuit_setup(
    p_table_name TEXT,
    p_column_name TEXT,
    p_pk_column_name TEXT DEFAULT 'id'
);

Parameters:

p_table_name: Target table name (required)
p_column_name: Column to index (required, must be text-compatible)
p_pk_column_name: Primary key column name (default: ‘id’)

Operations Performed:

Validates table and column existence
Constructs bitmap index structure
Generates strongly-typed wrapper functions
Installs AFTER triggers for INSERT, UPDATE, DELETE operations
Returns detailed confirmation message

Example:

SELECT biscuit_setup('customer_records', 'email_address', 'customer_id');

Output:

Biscuit index built successfully.
Created biscuit_match() and biscuit_match_rows() functions for table: customer_records
Columns: customer_id integer, email_address text
Successfully created trigger on table: customer_records
The index will now automatically update on INSERT, UPDATE, and DELETE operations.

Option 2: Manual Configuration (Advanced)

For fine-grained control, configure components individually:

Step 1: Index Construction

SELECT biscuit_build_index(
    table_name TEXT,
    column_name TEXT,
    pk_column_name TEXT DEFAULT 'id'
);

Step 2: Function Generation

SELECT biscuit_create_match_function();

This creates two query functions:

biscuit_match(pattern TEXT): Returns SETOF table_type
biscuit_match_rows(pattern TEXT): Returns TABLE(pk type, value TEXT)

Step 3: Trigger Activation

SELECT biscuit_enable_triggers();

Installs trigger named biscuit_auto_update on the indexed table.

Primary Key Considerations

BISCUIT supports heterogeneous primary key types through automatic type conversion:

Performance Note: Integer primary keys receive optimized conversion paths. UUID and text types have minimal overhead. Custom types utilize PostgreSQL’s type output functions.

Index Scope and Limitations

Single Column Per Index: Each index covers one text column
Multiple Indexes Supported: Create separate indexes for different columns
Table Scope: One active BISCUIT index per table
Character Limit: Positions beyond 256 characters are not indexed (but stored)

Example: Multiple Column Indexing

-- Index email column
SELECT biscuit_setup('users', 'email', 'id');

-- To index another column, use a different approach
-- (current implementation: one index per table)

Query Interface

Function Signatures and Return Types

Primary Query Functions

1. Full Tuple Retrieval (Strongly-Typed)

biscuit_match(pattern TEXT) RETURNS SETOF table_type

Returns complete rows from the indexed table matching the specified pattern. Return type matches the exact schema of the indexed table.

Characteristics:

Type-safe: Returns actual table type
No column specification required
Supports standard SQL operations (WHERE, ORDER BY, JOIN)
Optimal for production queries

Example:

SELECT * FROM biscuit_match('%@example.com%');
SELECT id, email FROM biscuit_match('user%') WHERE created_at > '2024-01-01';
SELECT COUNT(*) FROM biscuit_match('%gmail%');

2. Generic Tuple Retrieval

biscuit_match_rows(pattern TEXT) RETURNS TABLE(pk primary_key_type, value TEXT)

Returns primary key and indexed column value pairs.

Use Cases:

When only key-value pairs are needed
Intermediate processing before joining
Reduced data transfer overhead

Example:

SELECT pk, value FROM biscuit_match_rows('%pattern%');

-- Join back to original table if needed
SELECT t.*
FROM original_table t
JOIN biscuit_match_rows('%pattern%') b ON t.id = b.pk;

3. Count Aggregation

biscuit_match_count(pattern TEXT) RETURNS INTEGER

Returns the count of matching records without materializing result set.

Performance: This is the fastest query method, performing only bitmap operations without tuple retrieval.

Example:

SELECT biscuit_match_count('%search_term%');

4. Key-Value Extraction

biscuit_match_keys(pattern TEXT) RETURNS TABLE(pk TEXT, value TEXT)

Low-level function returning primary keys and indexed values as text. Used internally by higher-level functions.

Note: Primary keys are returned as text regardless of original type.

Query Patterns and SQL Integration

Standard SQL Operations

All biscuit_match() results support complete SQL operations:

-- Filtering
SELECT * FROM biscuit_match('%pattern%')
WHERE status = 'active' AND created_at > NOW() - INTERVAL '30 days';

-- Ordering
SELECT * FROM biscuit_match('A%')
ORDER BY name DESC, created_at ASC
LIMIT 100;

-- Aggregation
SELECT category, COUNT(*), AVG(price)
FROM biscuit_match('%widget%')
GROUP BY category
HAVING COUNT(*) > 10;

-- Joins
SELECT p.*, c.category_name
FROM biscuit_match('%special%') p
JOIN categories c ON p.category_id = c.id;

-- Subqueries
SELECT *
FROM products
WHERE id IN (
    SELECT id FROM biscuit_match('%premium%')
);

-- Common Table Expressions
WITH matched_products AS (
    SELECT * FROM biscuit_match('%electronics%')
)
SELECT m.*, s.stock_level
FROM matched_products m
JOIN stock s ON m.id = s.product_id;

Performance Considerations

Optimal Patterns:

Use bitmap index for pattern matching
Apply additional filters after pattern match
Leverage PostgreSQL’s query planner for joins

Query Plan Example:

EXPLAIN ANALYZE
SELECT * FROM biscuit_match('%search%')
WHERE status = 'active';

-- Result: Bitmap operations complete in milliseconds
-- Additional filter applied to result set

Pattern Matching Specification

Wildcard Operators

BISCUIT implements SQL LIKE-style pattern matching with two wildcard operators:

| Operator | Semantics | Character Match | Examples | |–––––|———–|––––––––|–––––| | % | Zero or more characters | [0, ∞) | 'a%' → “a”, “ab”, “abcdef” | | _ | Exactly one character | 1 | 'a_c' → “abc”, “axc” |

Pattern Categories and Optimization Paths

1. Exact Match Patterns

Pattern Structure: No wildcards Example: 'exact_string' Optimization: Length bitmap intersection with position-0 character match Complexity: O(1)

SELECT biscuit_match_count('hello');
-- Matches only "hello" exactly

2. Prefix Patterns

Pattern Structure: Literal prefix followed by % Example: 'prefix%' Optimization: Position-0 character sequence with length≥prefix_length filter Complexity: O(log n)

SELECT * FROM biscuit_match('user_');
-- Matches: "user_admin", "user_john", etc.

3. Suffix Patterns

Pattern Structure: % followed by literal suffix Example: '%suffix' Optimization: Negative-position character sequence with length≥suffix_length filter Complexity: O(log n)

SELECT * FROM biscuit_match('%@gmail.com');
-- Matches: "john@gmail.com", "admin@gmail.com"

4. Containment Patterns

Pattern Structure: %literal% Example: '%search%' Optimization: Character cache intersection Complexity: O(1) to O(m), m=pattern_length

SELECT * FROM biscuit_match('%error%');
-- Matches any string containing "error"

5. Complex Multi-Part Patterns

Pattern Structure: Multiple literal segments with wildcards Examples:

'start%middle%end'
'%part1%part2%'
'a%b%c%d'

Optimization: Windowed position matching with recursive refinement Complexity: O(k × m), k=parts, m=search_window

SELECT * FROM biscuit_match('user_%@%.com');
-- Matches: "user_john@example.com", "user_admin@test.com"

6. Single-Character Wildcard Patterns

Pattern Structure: Underscores with optional literals Examples:

'___' (exactly 3 characters)
'a_c' (a, any char, c)
'___%' (at least 3 characters)

Optimization: Length bitmap or position bitmap intersection Complexity: O(1) to O(log n)

SELECT * FROM biscuit_match('___');
-- Matches: "abc", "xyz", "123" (any 3-character string)

SELECT * FROM biscuit_match('test_');
-- Matches: "test1", "testA", "test_" (test + any char)

Pattern Syntax Reference

| Pattern | Description | Matches | Does Not Match | |———|———––|———|––––––––| | 'abc' | Exact string | “abc” | “abcd”, “xabc”, “ABC” | | 'a%' | Starts with ‘a’ | “a”, “abc”, “a123” | “ba”, “ABC” | | '%z' | Ends with ‘z’ | “z”, “xyz”, “123z” | “za”, “Z” | | '%abc%' | Contains ‘abc’ | “abc”, “xabcx”, “123abc” | “ab”, “ABC” | | 'a%z' | Starts ‘a’, ends ‘z’ | “az”, “abcz”, “a123xyz” | “za”, “abc” | | '_' | Single character | “a”, “1”, “” | “”, “ab” | | '___' | Exactly 3 chars | “abc”, “123”, “__” | “ab”, “abcd” | | 'a_c' | ‘a’ + any + ‘c’ | “abc”, “axc”, “a1c” | “ac”, “abbc” | | '%a%b%' | ‘a’ before ‘b’ | “ab”, “axb”, “xaby” | “ba”, “bxa” | | '%' | Any string | All non-NULL | NULL | | '' | Empty string | “” | Any non-empty |

Special Cases

Empty Pattern

SELECT biscuit_match_count('');
-- Returns count of empty strings in indexed column

Match All

SELECT biscuit_match_count('%');
-- Returns count of all non-deleted records

Case Sensitivity

BISCUIT patterns are case-sensitive by default. For case-insensitive matching:

-- Option 1: Index lowercase values
CREATE TABLE users_normalized (
    id SERIAL PRIMARY KEY,
    email TEXT,
    email_lower TEXT GENERATED ALWAYS AS (LOWER(email)) STORED
);

SELECT biscuit_setup('users_normalized', 'email_lower', 'id');
SELECT * FROM biscuit_match(LOWER('%SEARCH%'));

-- Option 2: Use expression index (future enhancement)

NULL Handling

NULL values are never indexed
NULL column values are treated as empty strings during indexing
Pattern matching never returns NULL values

-- NULL values are skipped
INSERT INTO table (id, text_col) VALUES (1, NULL);
SELECT biscuit_match_count('%'); 
-- Does not include row 1

CRUD Operations and Index Maintenance

Automatic Synchronization

When triggers are enabled via biscuit_setup() or biscuit_enable_triggers(), all DML operations automatically maintain index consistency with zero application code changes.

INSERT Operations

Behavior: New records are immediately indexed and available for queries.

Algorithm:

Extract primary key and indexed column value from new tuple
Convert primary key to canonical string representation
Allocate index slot (reuse free slot if available, otherwise expand)
Generate positional bitmaps for each character
Update length index and character cache
Register primary key in hash table

Complexity: O(L), where L = string length (up to 256 characters)

Example:

INSERT INTO products (id, name, description)
VALUES (10001, 'Premium Widget', 'High-quality widget');

-- Index automatically updated
SELECT * FROM biscuit_match('%Widget%');
-- Returns the newly inserted row immediately

Bulk Insert Performance:

-- For large bulk inserts, consider disabling triggers
SELECT biscuit_disable_triggers();

INSERT INTO products (id, name)
SELECT generate_series(1, 1000000), 'Product ' || generate_series(1, 1000000);

-- Rebuild index after bulk operation
SELECT biscuit_build_index('products', 'name', 'id');
SELECT biscuit_enable_triggers();

UPDATE Operations

Behavior: Updates are intelligently optimized based on string similarity.

Incremental Update Path

Criteria for Incremental Update:

Same string length (before and after)
Minimum length ≥ 3 characters
Character differences < 20% of string length
Maximum 3 changed characters

Algorithm:

Compute character-level diff between old and new values
Remove old characters from affected position bitmaps
Add new characters to affected position bitmaps
Update character cache for changed characters only

Complexity: O(d), where d = number of different characters

Example (Incremental):

-- Original: "john.doe@example.com"
UPDATE users SET email = 'john.doe@example.org' WHERE id = 123;
-- Only 3 characters changed (com→org), uses incremental update

Full Reindex Path

Criteria:

Different string lengths
Changes exceed incremental update threshold
String < 3 characters

Algorithm:

Remove all position bitmaps for old value
Remove from length index and character cache
Re-index completely with new value
Update length index and character cache

Complexity: O(L), where L = new string length

Example (Full Reindex):

-- Original: "short"
UPDATE users SET email = 'much.longer.email@example.com' WHERE id = 123;
-- Different lengths, requires full reindex

Statistics Tracking:

SELECT biscuit_index_status();
-- Shows ratio of incremental vs full updates
-- Example: "Updates: 1000 (incr: 750, 75.0%)"

DELETE Operations

Behavior: Implements lazy deletion with tombstone bitmaps for optimal performance.

Algorithm (O(1) Lazy Deletion):

Hash table lookup to find internal index (O(1))
Add index to tombstone bitmap (O(1))
Store index in deletion queue
Remove primary key from hash table
Add slot to free list for future reuse
Defer removal from position/length bitmaps

Complexity: O(1) - actual cleanup deferred

Example:

DELETE FROM products WHERE id = 123;
-- Returns immediately, cleanup happens later

Tombstone Cleanup Process

Automatic Cleanup Triggers:

Tombstone count ≥ threshold (default: 1000)
Manual invocation: SELECT biscuit_cleanup();

Cleanup Algorithm:

Batch remove tombstoned indices from all bitmaps
Free associated string memory
Clear tombstone bitmap
Reset deletion queue

Cleanup Statistics:

SELECT biscuit_index_status();
-- Pending tombstones: 856 (85.6% of deletes)
-- Total cleanups: 12
-- Items cleaned: 11442

Performance Characteristics:

Note: Query operations automatically filter tombstones, so deleted records never appear in results despite deferred cleanup.

Upsert Operations (INSERT … ON CONFLICT)

Behavior: Automatically handled through INSERT and UPDATE trigger paths.

Example:

INSERT INTO products (id, name)
VALUES (123, 'New Product')
ON CONFLICT (id)
DO UPDATE SET name = EXCLUDED.name;

-- If row 123 exists: triggers UPDATE path
-- If row 123 doesn't exist: triggers INSERT path

Transaction Semantics

ACID Compliance:

Atomicity: Index updates commit/rollback with transaction
Consistency: Index always reflects committed table state
Isolation: Follows PostgreSQL’s MVCC isolation levels
Durability: Index persists across server restarts (rebuild required)

Example:

BEGIN;
    INSERT INTO products (id, name) VALUES (1, 'Product A');
    SELECT * FROM biscuit_match('%Product A%');  -- Visible in transaction
ROLLBACK;

SELECT * FROM biscuit_match('%Product A%');  -- Not visible, index reverted

Current Limitation: Index resides in shared memory and must be rebuilt after server restart.

Trigger Management

Enable Triggers

SELECT biscuit_enable_triggers();

Creates trigger: biscuit_auto_update as AFTER INSERT OR UPDATE OR DELETE FOR EACH ROW.

Disable Triggers

SELECT biscuit_disable_triggers();

Removes triggers but preserves index structure. Use for:

Bulk data operations
Table truncation
Data migration
Testing scenarios

Check Trigger Status

SELECT biscuit_index_status();
-- Shows whether triggers are active

Architecture and Implementation

System Components

┌─────────────────────────────────────────────────────────────┐
│                     PostgreSQL Backend                       │
├─────────────────────────────────────────────────────────────┤
│  Query Interface  │  Trigger System  │  Memory Management   │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                    BISCUIT Extension Layer                   │
├──────────────────────┬──────────────────┬───────────────────┤
│  Pattern Parser      │  Query Executor  │  CRUD Handler     │
│  - Wildcard analysis │  - Bitmap ops    │  - Insert/Update  │
│  - Optimization      │  - Result filter │  - Delete/Cleanup │
└──────────────────────┴──────────────────┴───────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                     Index Data Structures                    │
├──────────────┬──────────────┬──────────────┬────────────────┤
│  Positional  │  Character   │   Length     │   PK Hash      │
│  Bitmaps     │  Cache       │   Index      │   Table        │
│  (256 pos)   │  (256 chars) │  (L bitmaps) │   (O(1))       │
└──────────────┴──────────────┴──────────────┴────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│               Roaring Bitmap Implementation                  │
├──────────────────────────────────────────────────────────────┤
│  - Compressed bitmap storage                                 │
│  - Fast set operations (AND, OR, ANDNOT)                    │
│  - Memory-efficient sparse representation                    │
└──────────────────────────────────────────────────────────────┘

Data Structure Details

1. Positional Character Index

Purpose: Track character occurrences at specific positions

Structure:

typedef struct {
    int pos;                  // Position in string
    RoaringBitmap *bitmap;    // Record IDs containing char at pos
} PosEntry;

typedef struct {
    PosEntry *entries;        // Sorted array of position entries
    int count;                // Number of positions tracked
    int capacity;             // Allocated capacity
} CharIndex;

CharIndex pos_idx[256];       // Forward positions (0, 1, 2, ...)
CharIndex neg_idx[256];       // Reverse positions (-1, -2, -3, ...)

Example: For strings [“cat”, “car”, “dog”]:

pos_idx['c'][0] = {0, 1}     // 'c' at position 0 in records 0, 1
pos_idx['a'][1] = {0, 1}     // 'a' at position 1 in records 0, 1
pos_idx['t'][2] = {0}        // 't' at position 2 in record 0
neg_idx['t'][-1] = {0}       // 't' at last position in record 0

Access Complexity: Binary search over positions: O(log P), P ≤ 256

2. Character Existence Cache

Purpose: O(1) lookup for “contains character” queries

Structure:

RoaringBitmap *char_cache[256];  // Union of all position bitmaps per char

Construction:

char_cache['a'] = pos_idx['a'][0] ∪ pos_idx['a'][1] ∪ ... ∪ pos_idx['a'][n]

Use Case: Pattern %a% directly queries char_cache['a']

3. Length Index

Purpose: Filter records by string length

Structure:

typedef struct {
    RoaringBitmap **length_bitmaps;  // Array indexed by length
    int max_length;                  // Maximum observed length
} LengthIndex;

Example:

length_bitmaps[3] = {0, 1, 2}    // Records 0, 1, 2 have length 3
length_bitmaps[5] = {3, 4}       // Records 3, 4 have length 5

Use Case: Pattern ___ queries length_bitmaps[3] directly

4. Primary Key Hash Table

Purpose: O(1) mapping from primary key to internal index

Implementation:

typedef struct {
    char pk_str[NAMEDATALEN];  // Fixed-size key (max 64 bytes)
    uint32_t idx;              // Internal record index
} PKMapEntry;

HTAB *pk_to_index;  // PostgreSQL hash table

Operations:

Insert: hash_search(HASH_ENTER) - O(1)
Lookup: hash_search(HASH_FIND) - O(1)
Delete: hash_search(HASH_REMOVE) - O(1)

Memory: ~96 bytes per entry (key + value + overhead)

5. Lazy Deletion System

Purpose: Defer expensive cleanup operations

Structure:

typedef struct {
    RoaringBitmap *tombstones;      // Bitmap of deleted indices
    uint32_t *deleted_indices;       // Queue for cleanup
    int tombstone_count;             // Current tombstone count
    int tombstone_threshold;         // Cleanup trigger threshold
    int64 total_cleanups;            // Lifetime cleanup count
    int64 items_cleaned;             // Total items cleaned
} LazyDeletion;

Workflow:

Delete: Mark bit in tombstone bitmap (O(1))
Query: Filter results using ANDNOT tombstones (O(log n))
Cleanup: Batch remove when threshold reached (O(n × 256))

Adaptive Threshold: Adjusts based on query/delete ratio

Memory Management

Allocation Strategy

Memory Context: Dedicated AllocSetContext for index structures

Lifetime: Until explicit index rebuild or server restart
Isolation: Independent of query contexts
Cleanup: Automatic on context deletion

Size Estimation:

Total Memory = Base + Positional + Cache + Length + Hash + Strings

Base       = ~1 MB (structs and metadata)
Positional = ~256 chars × P positions × 64 bytes/bitmap ≈ 16P KB
Cache      = ~256 chars × 64 bytes/bitmap ≈ 16 KB
Length     = L × 64 bytes/bitmap (L = max length)
Hash       = N × 96 bytes (N = record count)
Strings    = N × avg_length × 2 (data + PK)

Example (1M records, avg 50 chars, max 100 positions):
= 1 MB + 1.6 MB + 16 KB + 6.4 KB + 96 MB + 100 MB ≈ 200 MB

Compression Benefits: Roaring Bitmaps achieve 10-50× compression for sparse data

Garbage Collection

Automatic:

Tombstone cleanup when threshold reached
Free list management for slot reuse

Manual:

SELECT biscuit_cleanup();  -- Force tombstone cleanup

Memory Leak Prevention:

All allocations tracked in index context
Single deallocation point on rebuild
No external references retained

Concurrency Model

Index Access: Read-optimized, single-writer model

Queries: Concurrent read access (no locking)
Modifications: Serialized through PostgreSQL trigger system
Consistency: PostgreSQL’s MVCC handles transaction isolation

Limitation: High-concurrency write workloads may experience contention at trigger level (PostgreSQL limitation, not BISCUIT-specific)

Scalability: Read queries scale linearly across cores

Advanced Features

Incremental Updates

For same-length strings with <20% character changes, BISCUIT uses incremental updates:

-- Original: "hello world"
-- Updated:  "hello wurld"  
-- Only updates changed positions (saves ~80% of work)

Lazy Deletion with Smart Cleanup

Deletes are instant (O(1)) using tombstone bitmaps. Cleanup triggers automatically when:

Tombstone count reaches threshold (default: 1000)
Query/Delete ratio suggests overhead
Manual trigger: SELECT biscuit_cleanup();

Pattern Optimization

BISCUIT automatically detects and optimizes:

Wildcard-only patterns: %%%, ___
Single-character positioned: %a%, a%, %a
Prefix patterns: abc%
Suffix patterns: %xyz
Exact length: ___ (3 chars)

API Reference

Index Management Functions

biscuit_setup()

Complete initialization function combining index build, function creation, and trigger setup.

biscuit_setup(
    p_table_name TEXT,
    p_column_name TEXT,
    p_pk_column_name TEXT DEFAULT 'id'
) RETURNS TEXT

Parameters:

p_table_name: Name of table to index (required)
p_column_name: Column containing text data to index (required)
p_pk_column_name: Primary key column name (default: ‘id’)

Returns: Status message with setup details

Example:

SELECT biscuit_setup('customers', 'email', 'customer_id');

Output:

Biscuit index built successfully.
Created biscuit_match() and biscuit_match_rows() functions for table: customers
Columns: customer_id integer, email text
Successfully created trigger on table: customers
The index will now automatically update on INSERT, UPDATE, and DELETE operations.

Errors:

Primary key column not found
Indexed column not found
Insufficient permissions
Table does not exist

biscuit_build_index()

Constructs bitmap index structure from table data.

biscuit_build_index(
    table_name TEXT,
    column_name TEXT,
    pk_column_name TEXT DEFAULT 'id'
) RETURNS BOOLEAN

Parameters:

table_name: Target table name
column_name: Column to index
pk_column_name: Primary key column (default: ‘id’)

Returns: TRUE on success

Behavior:

Drops existing index if present
Scans entire table sequentially
Builds all bitmap structures
Allocates shared memory context

Performance: O(n × L), n=rows, L=avg string length

Example:

SELECT biscuit_build_index('products', 'name', 'id');

biscuit_create_match_function()

Generates strongly-typed wrapper functions for queries.

biscuit_create_match_function() RETURNS TEXT

Returns: Confirmation message with function signatures

Generated Functions:

biscuit_match(pattern TEXT) RETURNS SETOF table_type
biscuit_match_rows(pattern TEXT) RETURNS TABLE(pk type, value TEXT)

Prerequisite: Index must be built first

Example:

SELECT biscuit_create_match_function();

Output:

Created biscuit_match() and biscuit_match_rows() functions for table: products
Columns: id integer, name text, category text, price numeric

biscuit_enable_triggers()

Activates automatic index maintenance triggers.

biscuit_enable_triggers() RETURNS TEXT

Returns: Status message

Behavior:

Creates AFTER INSERT OR UPDATE OR DELETE trigger
Trigger name: biscuit_auto_update
Fires FOR EACH ROW
Calls biscuit_trigger() function

Example:

SELECT biscuit_enable_triggers();

biscuit_disable_triggers()

Deactivates automatic index maintenance.

biscuit_disable_triggers() RETURNS TEXT

Returns: Status message

Use Cases:

Bulk data operations
Table maintenance
Performance testing
Data migration

Example:

SELECT biscuit_disable_triggers();
-- Perform bulk operations
SELECT biscuit_enable_triggers();

Query Functions

biscuit_match()

Primary query interface returning complete table rows.

biscuit_match(pattern TEXT) RETURNS SETOF table_type

Parameters:

pattern: Wildcard pattern (SQL LIKE syntax)

Returns: Complete rows matching pattern

Characteristics:

Strongly-typed return
Full SQL compatibility
Automatic tombstone filtering
Optimal for production use

Example:

SELECT * FROM biscuit_match('%@example.com');
SELECT id, name FROM biscuit_match('Product%') WHERE price > 100;

biscuit_match_rows()

Alternative query interface returning key-value pairs.

biscuit_match_rows(pattern TEXT) RETURNS TABLE(pk primary_key_type, value TEXT)

Parameters:

pattern: Wildcard pattern

Returns: Table of (primary_key, indexed_value) pairs

Example:

SELECT * FROM biscuit_match_rows('%error%');

biscuit_match_count()

Efficient count-only query.

biscuit_match_count(pattern TEXT) RETURNS INTEGER

Parameters:

pattern: Wildcard pattern

Returns: Count of matching records

Performance: Fastest query method (no tuple materialization)

Example:

SELECT biscuit_match_count('%gmail.com%');

biscuit_match_keys()

Low-level interface returning primary keys and values as text.

biscuit_match_keys(pattern TEXT) RETURNS TABLE(pk TEXT, value TEXT)

Parameters:

pattern: Wildcard pattern

Returns: Table of (pk_text, value_text) pairs

Note: Used internally by higher-level functions

Example:

SELECT pk::INTEGER, value FROM biscuit_match_keys('%pattern%');

Status and Diagnostic Functions

biscuit_index_status()

Comprehensive index status report.

biscuit_index_status() RETURNS TEXT

Returns: Formatted status report including:

Table and column information
Record counts (active, total, free, tombstoned)
Memory usage
CRUD statistics
Lazy deletion metrics
Optimization indicators

Example:

SELECT biscuit_index_status();

Sample Output:

========================================
Biscuit Index v3 - FIXED
========================================
Table: products
Column: name
Primary Key: id
Active Records: 998567
Total Slots: 1000000
Free Slots: 1433
Tombstoned Slots: 856
Max length: 128
Memory: 18.3 MB
----------------------------------------
CRUD Statistics:
  Inserts: 1000000
  Deletes: 1433
  Updates: 125678 (incr: 89234, 71.0%)
  Queries: 45892
----------------------------------------
Lazy Deletion Status:
  Pending tombstones: 856 (59.7% of deletes)
  Cleanup threshold: 1000
  Total cleanups: 1
  Items cleaned: 577
  Query/Delete ratio: 32.02
----------------------------------------

biscuit_get_active_count()

Returns active record count.

biscuit_get_active_count() RETURNS INTEGER

Returns: Count of non-deleted records

Example:

SELECT biscuit_get_active_count();
-- Output: 998567

biscuit_get_free_slots()

Returns available slot count for reuse.

biscuit_get_free_slots() RETURNS INTEGER

Returns: Number of slots available for new inserts

Example:

SELECT biscuit_get_free_slots();
-- Output: 1433

biscuit_get_tombstone_count()

Returns pending tombstone count.

biscuit_get_tombstone_count() RETURNS INTEGER

Returns: Number of deleted records awaiting cleanup

Use Case: Monitor cleanup necessity

Example:

SELECT biscuit_get_tombstone_count();
-- Output: 856

biscuit_version()

Returns extension version.

biscuit_version() RETURNS TEXT

Returns: Version string

Example:

SELECT biscuit_version();
-- Output: 1.0.1-Biscuit

Maintenance Functions

biscuit_cleanup()

Manual tombstone cleanup.

biscuit_cleanup() RETURNS TEXT

Returns: Cleanup summary

Behavior:

Removes tombstoned indices from all bitmaps
Frees associated memory
Resets tombstone counter
Updates cleanup statistics

Performance: O(n × 256), n=tombstone_count

Example:

SELECT biscuit_cleanup();

Output:

Tombstone cleanup complete:
  Cleaned: 856 tombstones
  Remaining: 0
  Total cleanups: 2

When to Use:

Before performance-critical operations
After bulk deletions
When tombstone count exceeds threshold
During maintenance windows

Internal Functions (Advanced)

biscuit_trigger()

Internal trigger function (do not call directly).

biscuit_trigger() RETURNS TRIGGER

Purpose: Maintains index consistency on DML operations

Called By: PostgreSQL trigger system

Handles:

INSERT: Adds new records to index
UPDATE: Applies incremental or full updates
DELETE: Marks tombstones

biscuit_match_tuples()

Generic tuple retrieval (legacy).

biscuit_match_tuples(pattern TEXT) RETURNS SETOF RECORD

Note: Use biscuit_match() instead for type-safe queries

Function Error Handling

All functions implement comprehensive error handling:

Common Errors:

ERROR: Index not built - Call biscuit_build_index() first
ERROR: NULL primary key - Primary key cannot be NULL
ERROR: Column not found - Verify table/column names
ERROR: Insufficient permissions - Requires table owner or superuser

Example Error:

SELECT * FROM biscuit_match('%pattern%');
-- ERROR: Index not built. Call biscuit_build_index() first.

Return Type Specifications

Monitoring and Diagnostics

Key Performance Indicators

1. Index Health Metrics

Active Record Count

SELECT biscuit_get_active_count();

Interpretation:

Should match table row count (excluding deleted)
Discrepancy indicates synchronization issues
Monitor trend for capacity planning

Free Slot Count

SELECT biscuit_get_free_slots();

Interpretation:

Available slots for insert reuse
High count after bulk deletes is normal
Prevents capacity expansion overhead

Tombstone Count

SELECT biscuit_get_tombstone_count();

Interpretation:

Pending deleted records
Critical Threshold: >1000 (default cleanup trigger)
Warning Threshold: >5000 (performance degradation)
Action Required: >10000 (manual cleanup recommended)

2. Operational Statistics

Comprehensive Status

SELECT biscuit_index_status();

Key Sections:

A. Capacity Metrics

Active Records: 998567     ← Currently queryable records
Total Slots: 1000000       ← Allocated capacity
Free Slots: 1433          ← Available for reuse
Tombstoned Slots: 856     ← Pending cleanup

B. CRUD Performance

Inserts: 1000000
Deletes: 1433
Updates: 125678 (incr: 89234, 71.0%)  ← Incremental update rate
Queries: 45892

Incremental Update Rate:

>70%: Excellent (similar updates)
50-70%: Good (moderate similarity)
<50%: Normal (diverse updates)
<20%: Review update patterns

C. Lazy Deletion Health

Pending tombstones: 856 (59.7% of deletes)
Cleanup threshold: 1000
Total cleanups: 1
Items cleaned: 577
Query/Delete ratio: 32.02

Query/Delete Ratio:

>20: Healthy (reads dominate, lazy deletion beneficial)
10-20: Balanced (cleanup overhead acceptable)
<10: Write-heavy (consider more aggressive cleanup)

3. Memory Utilization

SELECT 
    (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+) MB'))[1]::NUMERIC AS memory_mb;

Guidelines:

Memory Growth = f(record_count, avg_length, char_diversity)

Expected Ranges:
  100K records:   2-5 MB
  1M records:     15-25 MB
  10M records:    150-250 MB
  100M records:   1.5-2.5 GB

Anomalies:
  >30% of expected: Check for memory leaks or fragmentation
  <50% of expected: Excellent compression (sparse data)

Monitoring Queries

Real-Time Health Check

-- Comprehensive health dashboard
WITH metrics AS (
    SELECT 
        biscuit_get_active_count() AS active,
        biscuit_get_free_slots() AS free,
        biscuit_get_tombstone_count() AS tombstones,
        (regexp_matches(biscuit_index_status(), 'Total Slots: ([0-9]+)'))[1]::INTEGER AS capacity,
        (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC AS memory_mb
)
SELECT 
    active,
    free,
    tombstones,
    capacity,
    memory_mb,
    ROUND(100.0 * active / capacity, 2) AS utilization_pct,
    ROUND(100.0 * tombstones / NULLIF(capacity - active, 0), 2) AS tombstone_ratio,
    CASE 
        WHEN tombstones > 10000 THEN 'CRITICAL - Cleanup Required'
        WHEN tombstones > 5000 THEN 'WARNING - Monitor Closely'
        WHEN tombstones > 1000 THEN 'NOTICE - Cleanup Soon'
        ELSE 'HEALTHY'
    END AS health_status
FROM metrics;

Sample Output:

 active  | free | tombstones | capacity | memory_mb | utilization_pct | tombstone_ratio | health_status
---------+------+------------+----------+-----------+-----------------+-----------------+---------------
 998567  | 1433 |        856 |  1000000 |     18.30 |           99.86 |           59.70 | HEALTHY

Performance Degradation Detection

-- Detect query performance issues
WITH baseline AS (
    SELECT 2.1 AS expected_ms  -- Baseline query time
),
current AS (
    SELECT 
        biscuit_get_tombstone_count() AS tombstones,
        -- Simulate query time (in production, measure actual queries)
        2.1 + (biscuit_get_tombstone_count()::NUMERIC / 1000) * 0.2 AS estimated_ms
)
SELECT 
    tombstones,
    estimated_ms,
    ROUND(((estimated_ms - expected_ms) / expected_ms) * 100, 2) AS degradation_pct,
    CASE 
        WHEN estimated_ms > expected_ms * 2 THEN 'ACTION REQUIRED'
        WHEN estimated_ms > expected_ms * 1.5 THEN 'DEGRADED'
        WHEN estimated_ms > expected_ms * 1.2 THEN 'ELEVATED'
        ELSE 'NORMAL'
    END AS performance_status
FROM current, baseline;

Capacity Planning

-- Project capacity needs
WITH stats AS (
    SELECT 
        biscuit_get_active_count() AS current_active,
        (regexp_matches(biscuit_index_status(), 'Total Slots: ([0-9]+)'))[1]::INTEGER AS capacity,
        (regexp_matches(biscuit_index_status(), 'Inserts: ([0-9]+)'))[1]::BIGINT AS total_inserts,
        (regexp_matches(biscuit_index_status(), 'Deletes: ([0-9]+)'))[1]::BIGINT AS total_deletes
),
growth AS (
    SELECT 
        current_active,
        capacity,
        total_inserts - total_deletes AS net_growth,
        CASE 
            WHEN total_deletes > 0 THEN total_inserts::NUMERIC / total_deletes
            ELSE NULL
        END AS insert_delete_ratio
    FROM stats
)
SELECT 
    current_active,
    capacity,
    capacity - current_active AS available,
    ROUND(100.0 * current_active / capacity, 2) AS utilization_pct,
    net_growth,
    insert_delete_ratio,
    CASE 
        WHEN current_active::NUMERIC / capacity > 0.9 THEN 'EXPAND SOON'
        WHEN current_active::NUMERIC / capacity > 0.8 THEN 'MONITOR'
        ELSE 'ADEQUATE'
    END AS capacity_status
FROM growth;

Alerting Thresholds

Critical Alerts

-- Immediate action required
DO $
DECLARE
    tombstone_count INTEGER;
    active_count INTEGER;
    capacity INTEGER;
BEGIN
    tombstone_count := biscuit_get_tombstone_count();
    active_count := biscuit_get_active_count();
    capacity := (regexp_matches(biscuit_index_status(), 'Total Slots: ([0-9]+)'))[1]::INTEGER;
    
    IF tombstone_count > 10000 THEN
        RAISE WARNING 'CRITICAL: Tombstone count % exceeds 10,000. Run biscuit_cleanup() immediately.', tombstone_count;
    END IF;
    
    IF active_count::NUMERIC / capacity > 0.95 THEN
        RAISE WARNING 'CRITICAL: Index utilization %.2f%% exceeds 95%%. Capacity expansion imminent.', 
                      100.0 * active_count / capacity;
    END IF;
END $;

Warning Alerts

-- Monitor closely, plan maintenance
DO $
DECLARE
    tombstone_count INTEGER;
    memory_mb NUMERIC;
BEGIN
    tombstone_count := biscuit_get_tombstone_count();
    memory_mb := (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC;
    
    IF tombstone_count BETWEEN 5000 AND 10000 THEN
        RAISE NOTICE 'WARNING: Tombstone count % in elevated range. Schedule cleanup.', tombstone_count;
    END IF;
    
    -- Assuming 1M records should use ~20MB
    IF memory_mb > 30 THEN
        RAISE NOTICE 'WARNING: Memory usage %.2f MB higher than expected. Check for leaks.', memory_mb;
    END IF;
END $;

Diagnostic Procedures

Issue: Slow Query Performance

Diagnosis:

-- Check tombstone overhead
SELECT biscuit_get_tombstone_count();

-- Check memory usage
SELECT (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC;

-- Review pattern complexity
EXPLAIN ANALYZE SELECT * FROM biscuit_match('%complex%pattern%');

Resolution:

-- If tombstones > 5000
SELECT biscuit_cleanup();

-- If memory excessive
SELECT biscuit_build_index('table', 'column', 'pk');  -- Rebuild

Issue: Index Out of Sync

Diagnosis:

-- Compare counts
SELECT 
    COUNT(*) AS table_count,
    biscuit_get_active_count() AS index_count
FROM your_table;

-- Check trigger status
SELECT tgname, tgenabled 
FROM pg_trigger 
WHERE tgname = 'biscuit_auto_update';

Resolution:

-- Rebuild index
SELECT biscuit_build_index('table', 'column', 'pk');
SELECT biscuit_enable_triggers();

Issue: High Memory Usage

Diagnosis:

-- Detailed memory breakdown
SELECT biscuit_index_status();

-- Check for long strings
SELECT MAX(LENGTH(column)), AVG(LENGTH(column)) FROM table;

Resolution:

-- Rebuild to compact
SELECT biscuit_build_index('table', 'column', 'pk');

-- Consider limiting indexed length (future enhancement)

Logging and Audit Trail

Enable PostgreSQL Logging

-- Log slow queries
SET log_min_duration_statement = 100;  -- Log queries >100ms

-- Log BISCUIT operations
SET client_min_messages = NOTICE;

-- View logs
SELECT * FROM pg_stat_statements 
WHERE query LIKE '%biscuit%' 
ORDER BY mean_exec_time DESC;

Custom Monitoring Table

-- Create monitoring log
CREATE TABLE biscuit_metrics (
    logged_at TIMESTAMPTZ DEFAULT NOW(),
    active_count INTEGER,
    free_slots INTEGER,
    tombstone_count INTEGER,
    memory_mb NUMERIC,
    query_count BIGINT,
    insert_count BIGINT,
    update_count BIGINT,
    delete_count BIGINT
);

-- Periodic logging (via cron or pg_cron)
INSERT INTO biscuit_metrics
SELECT 
    NOW(),
    biscuit_get_active_count(),
    biscuit_get_free_slots(),
    biscuit_get_tombstone_count(),
    (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC,
    (regexp_matches(biscuit_index_status(), 'Queries: ([0-9]+)'))[1]::BIGINT,
    (regexp_matches(biscuit_index_status(), 'Inserts: ([0-9]+)'))[1]::BIGINT,
    (regexp_matches(biscuit_index_status(), 'Updates: ([0-9]+)'))[1]::BIGINT,
    (regexp_matches(biscuit_index_status(), 'Deletes: ([0-9]+)'))[1]::BIGINT;

-- Analyze trends
SELECT 
    DATE_TRUNC('hour', logged_at) AS hour,
    AVG(tombstone_count) AS avg_tombstones,
    MAX(tombstone_count) AS max_tombstones,
    AVG(memory_mb) AS avg_memory
FROM biscuit_metrics
WHERE logged_at > NOW() - INTERVAL '7 days'
GROUP BY 1
ORDER BY 1 DESC;

Optimization Guidelines

Query Optimization

1. Pattern Design Best Practices

Optimal Patterns:

-- ✓ GOOD: Specific prefix
SELECT * FROM biscuit_match('user_admin%');

-- ✓ GOOD: Specific suffix  
SELECT * FROM biscuit_match('%@company.com');

-- ✓ GOOD: Single containment
SELECT * FROM biscuit_match('%error%');

-- ⚠ ACCEPTABLE: Two-part pattern
SELECT * FROM biscuit_match('%user%admin%');

-- ⚠ SLOWER: Three-part pattern
SELECT * FROM biscuit_match('%a%b%c%');

-- ✗ AVOID: Excessive parts (4+)
SELECT * FROM biscuit_match('%a%b%c%d%e%');

Pattern Complexity Impact:

Parts    Query Time    Recommendation
-----    ----------    --------------
1        ~2ms          Optimal
2        ~5ms          Excellent
3        ~12ms         Good
4        ~25ms         Acceptable
5+       ~50ms+        Avoid if possible

2. Query Structure Optimization

Use Count for Existence Checks:

-- ✗ AVOID: Materializing entire result set
SELECT COUNT(*) FROM biscuit_match('%pattern%');

-- ✓ BETTER: Direct count
SELECT biscuit_match_count('%pattern%');

Apply Filters After Pattern Match:

-- ✓ OPTIMAL: Pattern match first, then filter
SELECT * FROM biscuit_match('%@gmail.com%')
WHERE created_at > '2024-01-01' 
AND status = 'active';

-- ✗ SUBOPTIMAL: Complex joins before pattern match
SELECT * 
FROM complex_view v
WHERE v.email LIKE '%@gmail.com%';  -- Falls back to sequential scan

Leverage Result Reuse:

-- ✓ OPTIMAL: Use CTE for multiple operations on same pattern
WITH matches AS (
    SELECT * FROM biscuit_match('%search%')
)
SELECT 
    (SELECT COUNT(*) FROM matches) AS total,
    (SELECT COUNT(*) FROM matches WHERE category = 'A') AS category_a,
    (SELECT AVG(price) FROM matches) AS avg_price;

3. Index Selectivity Considerations

High Selectivity (Few Matches):

-- Excellent: ~0.1% selectivity
SELECT * FROM biscuit_match('unique_prefix_%');

Low Selectivity (Many Matches):

-- Consider adding filters or reformulating query
SELECT * FROM biscuit_match('%')  -- Returns everything
WHERE additional_criteria;

Selectivity Guidelines:

Match %       Query Performance    Recommendation
-------       -----------------    --------------
<1%           Excellent            Ideal use case
1-10%         Very Good            Well-suited
10-30%        Good                 Acceptable
30-50%        Fair                 Add filters
>50%          Poor                 Reconsider approach

CRUD Optimization

1. Bulk Insert Strategy

Small Batches (<1000 rows):

-- Keep triggers enabled
INSERT INTO products (id, name) VALUES
    (1, 'Product 1'),
    (2, 'Product 2'),
    ...
    (1000, 'Product 1000');
-- Index updates happen automatically

Large Batches (>10,000 rows):

-- Disable triggers, bulk insert, rebuild
SELECT biscuit_disable_triggers();

COPY products FROM '/data/products.csv' CSV;
-- or
INSERT INTO products SELECT * FROM staging_table;

SELECT biscuit_build_index('products', 'name', 'id');
SELECT biscuit_enable_triggers();

Performance Comparison:

10,000 records:
  With triggers:    ~2 seconds
  Disable/rebuild:  ~1 second (50% faster)

100,000 records:
  With triggers:    ~20 seconds  
  Disable/rebuild:  ~8 seconds (60% faster)

1,000,000 records:
  With triggers:    ~220 seconds
  Disable/rebuild:  ~85 seconds (61% faster)

2. Update Optimization

Maximize Incremental Updates:

-- ✓ GOOD: Same-length updates (incremental)
UPDATE users SET email = 'newemail@example.com'  -- 21 chars
WHERE email = 'oldemail@example.com';             -- 21 chars

-- ⚠ ACCEPTABLE: Different-length (full reindex)
UPDATE users SET email = 'verylongemail@example.com'  -- 27 chars
WHERE email = 'short@ex.co';                          -- 11 chars

Batch Updates:

-- For large update batches
SELECT biscuit_disable_triggers();

UPDATE products SET name = UPPER(name);

SELECT biscuit_build_index('products', 'name', 'id');
SELECT biscuit_enable_triggers();

3. Delete Strategy

Individual Deletes:

-- Lazy deletion is O(1), no optimization needed
DELETE FROM products WHERE id = 123;

Bulk Deletes:

-- Option 1: Let lazy deletion handle (recommended for <10K deletes)
DELETE FROM products WHERE category = 'obsolete';

-- Option 2: Disable/rebuild (for >100K deletes)
SELECT biscuit_disable_triggers();
DELETE FROM products WHERE created_at < '2020-01-01';
SELECT biscuit_build_index('products', 'name', 'id');
SELECT biscuit_enable_triggers();

Cleanup Scheduling:

-- Schedule periodic cleanup during low-traffic hours
-- Example: Every night at 2 AM via pg_cron
SELECT cron.schedule('biscuit-cleanup', '0 2 * * *', 
    'SELECT biscuit_cleanup()');

Memory Optimization

1. Monitor Memory Growth

-- Track memory over time
SELECT 
    NOW() AS checked_at,
    biscuit_get_active_count() AS records,
    (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC AS mb,
    ROUND((regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC 
          / NULLIF(biscuit_get_active_count(), 0) * 1024, 2) AS kb_per_record;

2. Memory Reduction Strategies

Rebuild Compaction:

-- Periodic rebuild to compact fragmentation
SELECT biscuit_build_index('table', 'column', 'pk');

Character Diversity Reduction (Application Level):

-- Normalize data before indexing
ALTER TABLE users ADD COLUMN email_normalized TEXT;
UPDATE users SET email_normalized = LOWER(email);
SELECT biscuit_setup('users', 'email_normalized', 'id');

Maintenance Optimization

1. Tombstone Management

Adaptive Threshold Tuning:

-- Current threshold: 1000 (default)
-- Adjust based on workload:

-- Read-heavy (Q/D ratio > 20): Increase threshold
-- Write-heavy (Q/D ratio < 10): Decrease threshold
-- Balanced: Keep default

-- Manual threshold adjustment (via C code recompilation)
#define TOMBSTONE_CLEANUP_THRESHOLD 5000  // Adjust value

Proactive Cleanup:

-- Before performance-critical operations
SELECT biscuit_cleanup();

-- Before maintenance windows
SELECT biscuit_cleanup();

2. Index Rebuild Strategy

When to Rebuild:

After bulk operations (>10% of table size)
Memory usage exceeds expectations by >50%
After schema changes
Periodic maintenance (monthly/quarterly)

Rebuild Procedure:

BEGIN;
    SELECT biscuit_disable_triggers();
    -- Perform any table maintenance
    SELECT biscuit_build_index('table', 'column', 'pk');
    SELECT biscuit_enable_triggers();
COMMIT;

Concurrency Optimization

1. Read Concurrency

Queries are lock-free:

-- Multiple concurrent queries: No contention
SELECT * FROM biscuit_match('%pattern1%');  -- Session 1
SELECT * FROM biscuit_match('%pattern2%');  -- Session 2
SELECT * FROM biscuit_match('%pattern3%');  -- Session 3
-- All execute concurrently without blocking

2. Write Concurrency

Minimize trigger overhead:

-- ✓ OPTIMAL: Batch small inserts
INSERT INTO products SELECT * FROM new_products_batch;

-- ⚠ SUBOPTIMAL: Individual inserts in loop
-- Each insert triggers index update

Transaction Batching:

-- Group related operations
BEGIN;
    INSERT INTO products VALUES (...);
    UPDATE products SET name = ... WHERE id = ...;
    DELETE FROM products WHERE id = ...;
COMMIT;
-- All index updates happen at commit

Performance Testing

Benchmark Query Performance

-- Timing template
\timing on
SELECT * FROM biscuit_match('%pattern%');
\timing off

-- With EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM biscuit_match('%pattern%');

Compare with Traditional Methods

-- BISCUIT
\timing on
SELECT COUNT(*) FROM biscuit_match('%search%');
\timing off

-- Traditional LIKE
\timing on
SELECT COUNT(*) FROM table WHERE column LIKE '%search%';
\timing off

-- GIN Index
\timing on  
SELECT COUNT(*) FROM table WHERE column ILIKE '%search%';
\timing off

-- ✗ NOT SUPPORTED
SELECT * FROM biscuit_match('[0-9]+');
SELECT * FROM biscuit_match('(abc|xyz)');

-- ✓ EQUIVALENT WORKAROUNDS
SELECT * FROM biscuit_match('abc%') 
UNION
SELECT * FROM biscuit_match('xyz%');

Architectural Constraints

1. Memory Residency

Index Location: Shared memory (non-persistent)

Implications:

Index lost on server restart
Must rebuild after restart
No automatic persistence

Startup Procedure:

-- Add to database initialization script
\c your_database
SELECT biscuit_setup('table1', 'column1', 'pk1');
SELECT biscuit_setup('table2', 'column2', 'pk2');

2. Single Index Instance

Limitation: Only one BISCUIT index per PostgreSQL instance

Implications:

-- ✗ Cannot maintain multiple indexes simultaneously
SELECT biscuit_setup('table1', 'col1', 'id');
SELECT biscuit_setup('table2', 'col2', 'id');  -- Replaces table1 index

-- ✓ Workaround: Separate databases
CREATE DATABASE db1;
\c db1
SELECT biscuit_setup('table', 'column', 'id');

CREATE DATABASE db2;
\c db2
SELECT biscuit_setup('table', 'column', 'id');

3. Transaction Isolation

Current Behavior: Index updates commit with transaction

Limitation: No snapshot isolation for queries

BEGIN;
    INSERT INTO products VALUES (1, 'New Product');
    -- Index updated in current transaction
    SELECT * FROM biscuit_match('%New%');  -- Sees new product
ROLLBACK;

-- Index reverted (product removed)
SELECT * FROM biscuit_match('%New%');  -- No longer sees product

Note: This is expected behavior and consistent with PostgreSQL’s MVCC model at the trigger level.

4. Single Column Per Index

Current Design: One index per table, one column per index

-- ✗ Cannot index multiple columns simultaneously
-- Must choose one column

-- ✓ Workaround: Concatenated column
ALTER TABLE products ADD COLUMN search_text TEXT
    GENERATED ALWAYS AS (name || ' ' || description || ' ' || sku) STORED;
SELECT biscuit_setup('products', 'search_text', 'id');

5. NULL Handling

Behavior: NULL values treated as empty strings

INSERT INTO products (id, name) VALUES (1, NULL);

-- NULL treated as ''
SELECT * FROM biscuit_match('%');     -- Includes row 1
SELECT * FROM biscuit_match('%%');    -- Includes row 1
SELECT biscuit_match_count('');       -- Counts row 1 if name is NULL

Performance Limitations

1. Pattern Complexity Scaling

Complexity: O(k × m), k = pattern parts, m = search window

-- Linear growth in query time with pattern parts
SELECT biscuit_match('%a%');              -- ~2ms
SELECT biscuit_match('%a%b%');            -- ~5ms
SELECT biscuit_match('%a%b%c%');          -- ~12ms
SELECT biscuit_match('%a%b%c%d%');        -- ~25ms
SELECT biscuit_match('%a%b%c%d%e%');      -- ~52ms

Recommendation: Limit patterns to 3-4 parts for optimal performance

2. Write Concurrency

Limitation: Trigger-based updates serialize at row level

Impact:

High-concurrency writes (>10K/sec):
  May experience contention
  Trigger overhead accumulates
  
Solution:
  Batch writes
  Disable triggers for bulk operations
  Consider application-level queue

3. Memory Growth

Behavior: Memory grows with:

Record count
String length diversity
Character diversity
Tombstone accumulation

-- Memory bounds
Minimum: 10% of indexed data size
Maximum: 50% of indexed data size
Typical: 20-25% of indexed data size

Mitigation:

-- Periodic rebuild for compaction
SELECT biscuit_build_index('table', 'column', 'pk');

-- Cleanup tombstones
SELECT biscuit_cleanup();

Compatibility Constraints

1. PostgreSQL Version

Minimum: PostgreSQL 11.0

Tested Versions:

PostgreSQL 11.x: ✓ Supported
PostgreSQL 12.x: ✓ Supported
PostgreSQL 13.x: ✓ Supported
PostgreSQL 14.x: ✓ Supported
PostgreSQL 15.x: ✓ Supported
PostgreSQL 16.x: ✓ Supported (expected)

2. Data Type Support

Primary Key Types:

Fully Supported: INT2, INT4, INT8, TEXT, VARCHAR, CHAR, UUID
Generic Support: All types with output functions
Not Supported: Composite types, arrays (without custom casting)

Indexed Column Types:

Required: TEXT, VARCHAR, CHAR, or castable to TEXT
Not Supported: Binary types (BYTEA), JSON, arrays without casting

3. Extension Dependencies

Required:

PostgreSQL server-side C API
Standard PostgreSQL libraries

Optional:

CRoaring library (for optimized bitmaps)
Falls back to built-in implementation if unavailable

Security Considerations

1. Permissions

Required Privileges:

-- Index creation requires
GRANT CREATE ON DATABASE TO user;        -- For extension
GRANT ALL ON TABLE target_table TO user; -- For triggers

-- Query execution requires
GRANT SELECT ON TABLE target_table TO user;

2. SQL Injection

Safe Query Patterns:

-- ✓ SAFE: Parameterized queries
PREPARE search_stmt AS 
    SELECT * FROM biscuit_match($1);
EXECUTE search_stmt('%user_input%');

-- ✓ SAFE: Application sanitization
pattern = sanitize_input(user_input)
query = f"SELECT * FROM biscuit_match('{pattern}')"

Unsafe Patterns:

-- ✗ UNSAFE: Direct string concatenation
SELECT * FROM biscuit_match('%' || user_input || '%');

3. Resource Exhaustion

Potential Vectors:

Complex patterns (>5 parts): High CPU usage
Very long patterns: Memory allocation
Rapid index rebuilds: Memory fragmentation

Mitigation:

-- Rate limit index rebuilds
-- Validate pattern complexity before execution
-- Set statement timeout
SET statement_timeout = '5s';

Known Issues

1. Index Persistence

Issue: Index not persistent across restarts

Status: Design limitation (in-memory structure)

Workaround: Rebuild on startup via initialization script

2. Multiple Index Support

Issue: Only one index per PostgreSQL instance

Status: Architectural limitation

Workaround: Use multiple databases or implement queuing

3. Incremental Update Threshold

Issue: Updates with >20% character changes trigger full reindex

Status: Configurable threshold (requires C code modification)

Workaround: Design application to minimize large updates

Future Enhancements (Roadmap)

Planned:

Persistent index storage
Multiple concurrent indexes
Case-insensitive mode option
Configurable position limit (>256 chars)
Pattern query optimizer
Index statistics collection

Under Consideration:

Fuzzy matching support
Phonetic matching
Unicode normalization
Multi-column composite indexes
Parallel query execution
Incremental checkpoint/restore

Troubleshooting

Common Issues and Solutions

Issue 1: “Index not built” Error

Symptom:

SELECT * FROM biscuit_match('%pattern%');
ERROR: Index not built. Call biscuit_build_index() first.

Cause: Index not initialized or server restarted

Solution:

-- Option 1: Quick setup
SELECT biscuit_setup('table_name', 'column_name', 'pk_column');

-- Option 2: Manual build
SELECT biscuit_build_index('table_name', 'column_name', 'pk_column');
SELECT biscuit_create_match_function();
SELECT biscuit_enable_triggers();

Prevention:

-- Add to database startup script
-- File: /etc/postgresql/14/main/init_biscuit.sql
\c your_database
SELECT biscuit_setup('table1', 'column1', 'pk1');

Issue 2: Query Returns No Results

Symptom:

SELECT biscuit_match_count('%known_value%');
-- Returns: 0 (expected >0)

Diagnosis:

-- Check if value exists in table
SELECT COUNT(*) FROM table_name WHERE column_name LIKE '%known_value%';

-- Check index status
SELECT biscuit_index_status();

-- Check trigger status
SELECT tgname, tgenabled FROM pg_trigger WHERE tgname = 'biscuit_auto_update';

Possible Causes and Solutions:

A. Index Out of Sync

-- Rebuild index
SELECT biscuit_build_index('table_name', 'column_name', 'pk_column');

B. Case Sensitivity

-- Pattern is case-sensitive
SELECT * FROM biscuit_match('%VALUE%');  -- Won't match "value"

-- Solution: Normalize
SELECT * FROM biscuit_match(LOWER('%VALUE%'));

C. Pattern Beyond 256 Characters

-- Match occurs after position 256
SELECT LENGTH(column_name) FROM table_name WHERE id = 123;
-- Returns: 280

-- Solution: Pattern must be within first 256 chars

Issue 3: Slow Query Performance

Symptom:

\timing on
SELECT * FROM biscuit_match('%pattern%');
Time: 850.234 ms
-- Expected: <10ms

Diagnosis:

-- Check tombstone count
SELECT biscuit_get_tombstone_count();
-- If >5000: Cleanup needed

-- Check pattern complexity
-- Count % symbols
SELECT regexp_count('%a%b%c%d%', '%');
-- If >4: Pattern too complex

-- Check memory usage
SELECT biscuit_index_status();

Solutions:

A. High Tombstone Count

SELECT biscuit_cleanup();
-- Should restore normal performance

B. Complex Pattern

-- Simplify pattern
-- From: SELECT * FROM biscuit_match('%a%b%c%d%e%');
-- To:   SELECT * FROM biscuit_match('%abc%de%');

C. Large Result Set

-- Check selectivity
SELECT biscuit_match_count('%pattern%');
-- If >50% of table: Not ideal for index

-- Solution: Add additional filters
SELECT * FROM biscuit_match('%pattern%')
WHERE created_at > '2024-01-01';

Issue 4: High Memory Usage

Symptom:

SELECT biscuit_index_status();
-- Memory: 500 MB (expected ~20 MB for dataset)

Diagnosis:

-- Check record count
SELECT biscuit_get_active_count();

-- Check string characteristics
SELECT 
    COUNT(*) AS records,
    AVG(LENGTH(column_name)) AS avg_length,
    MAX(LENGTH(column_name)) AS max_length,
    COUNT(DISTINCT column_name) AS unique_values
FROM table_name;

-- Check tombstone accumulation
SELECT biscuit_get_tombstone_count();

Solutions:

A. Tombstone Accumulation

SELECT biscuit_cleanup();
-- Then check memory again

B. Memory Fragmentation

-- Rebuild to compact
SELECT biscuit_build_index('table_name', 'column_name', 'pk_column');

C. Unexpected Data Characteristics

-- Very long strings or high diversity
-- Consider normalizing data
UPDATE table_name SET column_name = LEFT(column_name, 200);
SELECT biscuit_build_index('table_name', 'column_name', 'pk_column');

Issue 5: Trigger Not Firing

Symptom:

INSERT INTO table_name (id, column_name) VALUES (999, 'test');
SELECT * FROM biscuit_match('%test%');
-- Returns: no rows (expected 1 row)

Diagnosis:

-- Check trigger existence
SELECT tgname, tgenabled, tgisinternal
FROM pg_trigger
WHERE tgrelid = 'table_name'::regclass
AND tgname = 'biscuit_auto_update';

-- Check for errors in logs
SHOW log_destination;
-- Review PostgreSQL logs for trigger errors

Solutions:

A. Trigger Disabled

SELECT biscuit_enable_triggers();

B. Trigger Missing

SELECT biscuit_setup('table_name', 'column_name', 'pk_column');
-- Or manually
SELECT biscuit_enable_triggers();

C. Trigger Error

-- Check PostgreSQL logs
-- Common causes:
-- - NULL primary key
-- - Type conversion error
-- - Memory allocation failure

-- Rebuild to reset
SELECT biscuit_build_index('table_name', 'column_name', 'pk_column');
SELECT biscuit_enable_triggers();

Issue 6: “NULL primary key” Error

Symptom:

INSERT INTO table_name (column_name) VALUES ('test');
ERROR: NULL primary key in INSERT

Cause: Primary key column contains NULL

Solution:

A. Use Serial/Identity Column

ALTER TABLE table_name ALTER COLUMN id ADD GENERATED ALWAYS AS IDENTITY;

B. Set Default Value

ALTER TABLE table_name ALTER COLUMN id SET DEFAULT nextval('table_name_id_seq');

C. Ensure Non-NULL Constraint

ALTER TABLE table_name ALTER COLUMN id SET NOT NULL;

Issue 7: Out of Memory

Symptom:

SELECT biscuit_build_index('large_table', 'text_column', 'id');
ERROR: out of memory

Diagnosis:

-- Check table size
SELECT 
    pg_size_pretty(pg_total_relation_size('large_table')) AS table_size,
    COUNT(*) AS row_count,
    AVG(LENGTH(text_column)) AS avg_length
FROM large_table;

-- Check available memory
SHOW shared_buffers;
SHOW work_mem;

Solutions:

A. Increase Memory Allocation

-- In postgresql.conf
shared_buffers = 2GB       -- Increase from default
work_mem = 256MB           -- Increase from default

-- Restart PostgreSQL
sudo systemctl restart postgresql

B. Partition Table

-- Create partitioned table
CREATE TABLE large_table_partitioned (
    id SERIAL PRIMARY KEY,
    text_column TEXT,
    created_at DATE
) PARTITION BY RANGE (created_at);

-- Create partitions
CREATE TABLE large_table_2023 PARTITION OF large_table_partitioned
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

-- Index each partition separately
SELECT biscuit_setup('large_table_2023', 'text_column', 'id');

C. Reduce String Diversity

-- Normalize before indexing
UPDATE large_table SET text_column = LOWER(text_column);
SELECT biscuit_build_index('large_table', 'text_column', 'id');

Issue 8: Index-Table Mismatch

Symptom:

SELECT COUNT(*) FROM table_name;
-- Returns: 100000

SELECT biscuit_get_active_count();
-- Returns: 95000 (mismatch!)

Diagnosis:

-- Check trigger status
SELECT tgenabled FROM pg_trigger WHERE tgname = 'biscuit_auto_update';

-- Check for errors
SELECT biscuit_index_status();

-- Manual comparison
WITH table_data AS (
    SELECT COUNT(*) AS cnt FROM table_name
),
index_data AS (
    SELECT biscuit_get_active_count() AS cnt
)
SELECT 
    t.cnt AS table_count,
    i.cnt AS index_count,
    t.cnt - i.cnt AS discrepancy
FROM table_data t, index_data i;

Solutions:

A. Rebuild Index

SELECT biscuit_build_index('table_name', 'column_name', 'pk_column');
SELECT biscuit_enable_triggers();

B. Check for Concurrent Modifications

-- Ensure exclusive access during rebuild
BEGIN;
    LOCK TABLE table_name IN EXCLUSIVE MODE;
    SELECT biscuit_build_index('table_name', 'column_name', 'pk_column');
COMMIT;

Diagnostic Queries

Comprehensive Health Check

DO $
DECLARE
    table_count BIGINT;
    index_count INTEGER;
    tombstone_count INTEGER;
    free_slots INTEGER;
    memory_mb NUMERIC;
BEGIN
    EXECUTE 'SELECT COUNT(*) FROM your_table' INTO table_count;
    index_count := biscuit_get_active_count();
    tombstone_count := biscuit_get_tombstone_count();
    free_slots := biscuit_get_free_slots();
    memory_mb := (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC;
    
    RAISE NOTICE 'BISCUIT Health Check:';
    RAISE NOTICE '  Table rows: %', table_count;
    RAISE NOTICE '  Index active: %', index_count;
    RAISE NOTICE '  Discrepancy: %', table_count - index_count;
    RAISE NOTICE '  Tombstones: %', tombstone_count;
    RAISE NOTICE '  Free slots: %', free_slots;
    RAISE NOTICE '  Memory: % MB', memory_mb;
    
    IF table_count != index_count THEN
        RAISE WARNING 'Index out of sync! Rebuild recommended.';
    END IF;
    
    IF tombstone_count > 5000 THEN
        RAISE WARNING 'High tombstone count! Cleanup recommended.';
    END IF;
    
    IF memory_mb > (index_count::NUMERIC / 1000000 * 30) THEN
        RAISE WARNING 'High memory usage! Check for issues.';
    END IF;
END $;

Performance Baseline

-- Establish performance baseline
\timing on
SELECT biscuit_match_count('%');              -- Baseline: all records
SELECT biscuit_match_count('%common_char%');  -- Baseline: common pattern
SELECT biscuit_match_count('prefix%');        -- Baseline: prefix
SELECT biscuit_match_count('%suffix');        -- Baseline: suffix
\timing off

-- Save baselines and compare periodically

Memory Leak Detection

-- Monitor memory over operations
SELECT (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC AS mb_before;

-- Perform operations
INSERT INTO table_name SELECT generate_series(1, 10000), 'test' || generate_series(1, 10000);
DELETE FROM table_name WHERE id > 50000;
SELECT biscuit_cleanup();

-- Check memory after
SELECT (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC AS mb_after;

-- Memory should be stable or decrease after cleanup

Error Code Reference

Getting Help

Before Seeking Support:

Run comprehensive health check (above)
Review PostgreSQL logs for errors
Check extension version: SELECT biscuit_version()
Verify PostgreSQL version: SELECT version()
Document reproduction steps

Support Channels:

GitHub Issues: [Project Repository]
Mailing List: [Contact Information]
Documentation: [Online Docs URL]

Information to Provide:

Extension version
PostgreSQL version
Operating system
Table schema
Reproduction steps
Error messages
Output of SELECT biscuit_index_status()

Examples and Use Cases

Example 1: E-Commerce Product Search

Scenario: Product catalog with 5M products, need fast SKU and name searches

Schema:

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(50) NOT NULL,
    name TEXT NOT NULL,
    description TEXT,
    category VARCHAR(100),
    price NUMERIC(10, 2),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Sample data
INSERT INTO products (sku, name, category, price) VALUES
    ('ELEC-LAP-001', 'Dell XPS 15 Laptop', 'Electronics', 1299.99),
    ('ELEC-LAP-002', 'MacBook Pro 16"', 'Electronics', 2499.99),
    ('HOME-FURN-001', 'Modern Office Chair', 'Furniture', 349.99),
    ('HOME-FURN-002', 'Standing Desk', 'Furniture', 599.99);

BISCUIT Setup:

-- Index product names
SELECT biscuit_setup('products', 'name', 'id');

Query Examples:

-- Search by product name
SELECT * FROM biscuit_match('%Laptop%');

-- Case-insensitive search (with normalized column)
ALTER TABLE products ADD COLUMN name_lower TEXT 
    GENERATED ALWAYS AS (LOWER(name)) STORED;
SELECT biscuit_setup('products', 'name_lower', 'id');
SELECT * FROM biscuit_match('%laptop%');

-- Complex search
SELECT * FROM biscuit_match('%office%chair%')
WHERE price BETWEEN 200 AND 500;

-- Count matches
SELECT biscuit_match_count('%macbook%');

-- Top sellers matching pattern
SELECT p.*, s.sales_count
FROM biscuit_match('%laptop%') p
JOIN sales_summary s ON p.id = s.product_id
ORDER BY s.sales_count DESC
LIMIT 10;

Performance Comparison:

-- Traditional LIKE (sequential scan)
\timing on
SELECT COUNT(*) FROM products WHERE name LIKE '%Laptop%';
Time: 1850 ms

-- BISCUIT
SELECT biscuit_match_count('%Laptop%');
Time: 3.2 ms

-- Speedup: 578×

Example 2: Customer Email Domain Analysis

Scenario: Analyze customer email patterns, identify providers, find similar domains

Schema:

CREATE TABLE customers (
    customer_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email TEXT NOT NULL,
    first_name TEXT,
    last_name TEXT,
    signup_date DATE DEFAULT CURRENT_DATE,
    status VARCHAR(20) DEFAULT 'active'
);

-- Sample data
INSERT INTO customers (email, first_name, last_name) VALUES
    ('john.doe@gmail.com', 'John', 'Doe'),
    ('jane.smith@yahoo.com', 'Jane', 'Smith'),
    ('bob.wilson@company.com', 'Bob', 'Wilson'),
    ('alice.brown@gmail.com', 'Alice', 'Brown');

BISCUIT Setup:

-- Note: UUID primary key fully supported
SELECT biscuit_setup('customers', 'email', 'customer_id');

Query Examples:

-- Find all Gmail users
SELECT * FROM biscuit_match('%@gmail.com');

-- Find all corporate emails (non-freemail)
SELECT * 
FROM biscuit_match('%@%.com')
WHERE email NOT LIKE '%@gmail.com'
  AND email NOT LIKE '%@yahoo.com'
  AND email NOT LIKE '%@hotmail.com';

-- Domain distribution analysis
WITH email_domains AS (
    SELECT 
        customer_id,
        email,
        SUBSTRING(email FROM '@(.*)) AS domain
    FROM biscuit_match('%')
)
SELECT 
    domain,
    COUNT(*) AS customer_count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM email_domains
GROUP BY domain
ORDER BY customer_count DESC
LIMIT 10;

-- Find emails with specific patterns
SELECT * FROM biscuit_match('%.doe@%');      -- All .doe emails
SELECT * FROM biscuit_match('admin%@%');     -- Starts with admin
SELECT * FROM biscuit_match('%+%@%');        -- Contains plus addressing

Business Intelligence Query:

-- Signup trends by email provider
WITH gmail_users AS (
    SELECT signup_date 
    FROM biscuit_match('%@gmail.com')
),
yahoo_users AS (
    SELECT signup_date 
    FROM biscuit_match('%@yahoo.com')
)
SELECT 
    DATE_TRUNC('month', signup_date) AS month,
    'Gmail' AS provider,
    COUNT(*) AS signups
FROM gmail_users
GROUP BY 1

UNION ALL

SELECT 
    DATE_TRUNC('month', signup_date) AS month,
    'Yahoo' AS provider,
    COUNT(*) AS signups
FROM yahoo_users
GROUP BY 1

ORDER BY month DESC, provider;

Example 3: Log Analysis and Error Detection

Scenario: 50M log entries, need to quickly find error patterns and anomalies

Schema:

CREATE TABLE application_logs (
    log_id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ DEFAULT NOW(),
    level VARCHAR(10),
    message TEXT,
    service VARCHAR(50),
    user_id INTEGER
);

-- Sample data
INSERT INTO application_logs (level, message, service, user_id) VALUES
    ('INFO', 'User login successful', 'auth', 12345),
    ('ERROR', 'Database connection timeout', 'api', 67890),
    ('WARN', 'Rate limit approaching', 'api', 12345),
    ('ERROR', 'Failed to connect to payment gateway', 'payments', 11111);

BISCUIT Setup:

SELECT biscuit_setup('application_logs', 'message', 'log_id');

Query Examples:

-- Find all connection errors
SELECT * FROM biscuit_match('%connection%')
WHERE level = 'ERROR'
ORDER BY timestamp DESC
LIMIT 100;

-- Find timeout issues across all services
SELECT 
    service,
    COUNT(*) AS timeout_count,
    MIN(timestamp) AS first_seen,
    MAX(timestamp) AS last_seen
FROM biscuit_match('%timeout%')
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY service
ORDER BY timeout_count DESC;

-- Complex error pattern
SELECT * FROM biscuit_match('%failed%connect%')
WHERE level IN ('ERROR', 'CRITICAL')
AND timestamp > NOW() - INTERVAL '1 hour'
ORDER BY timestamp DESC;

-- Error correlation
WITH error_patterns AS (
    SELECT 
        log_id,
        message,
        timestamp,
        CASE 
            WHEN message LIKE '%timeout%' THEN 'timeout'
            WHEN message LIKE '%connection%' THEN 'connection'
            WHEN message LIKE '%memory%' THEN 'memory'
            ELSE 'other'
        END AS error_category
    FROM biscuit_match('%error%')
    WHERE timestamp > NOW() - INTERVAL '24 hours'
)
SELECT 
    error_category,
    COUNT(*) AS occurrences,
    COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS percentage
FROM error_patterns
GROUP BY error_category
ORDER BY occurrences DESC;

Real-Time Monitoring:

-- Create monitoring view
CREATE OR REPLACE VIEW recent_errors AS
SELECT 
    log_id,
    timestamp,
    level,
    message,
    service,
    user_id
FROM biscuit_match('%error%')
WHERE timestamp > NOW() - INTERVAL '5 minutes'
ORDER BY timestamp DESC;

-- Query view
SELECT * FROM recent_errors;

Example 4: Genetic Sequence Matching

Scenario: Bioinformatics database with DNA sequences, need pattern matching

Schema:

CREATE TABLE dna_sequences (
    sequence_id VARCHAR(50) PRIMARY KEY,
    organism TEXT,
    sequence TEXT,  -- ATCG sequence
    length INTEGER,
    discovered_date DATE
);

-- Sample data
INSERT INTO dna_sequences VALUES
    ('SEQ001', 'E. coli', 'ATCGATCGATCG', 12, '2024-01-15'),
    ('SEQ002', 'Human', 'GCTAGCTAGCTA', 12, '2024-01-16'),
    ('SEQ003', 'E. coli', 'ATCGGGCCCATCG', 13, '2024-01-17');

BISCUIT Setup:

SELECT biscuit_setup('dna_sequences', 'sequence', 'sequence_id');

Query Examples:

-- Find sequences with specific motif
SELECT * FROM biscuit_match('%ATCG%');

-- Find palindromic sequences (conceptual)
SELECT * FROM biscuit_match('%GATC%');  -- Start codon

-- Complex pattern matching
SELECT * FROM biscuit_match('%ATC%GGG%')
WHERE organism = 'E. coli';

-- Count sequences by pattern
SELECT 
    'Contains ATCG' AS pattern,
    biscuit_match_count('%ATCG%') AS count
UNION ALL
SELECT 
    'Starts with GC',
    biscuit_match_count('GC%')
UNION ALL
SELECT 
    'Ends with TA',
    biscuit_match_count('%TA');

Scenario: Social platform with 100M users, need fast username search

Schema:

CREATE TABLE users (
    user_id BIGSERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    display_name TEXT,
    bio TEXT,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    follower_count INTEGER DEFAULT 0
);

-- Sample data
INSERT INTO users (username, display_name, bio, follower_count) VALUES
    ('john_doe_123', 'John Doe', 'Software engineer', 1250),
    ('jane_smith', 'Jane Smith', 'Data scientist', 3400),
    ('tech_guru_99', 'Tech Guru', 'Tech enthusiast', 15000);

BISCUIT Setup:

-- Index lowercase usernames for case-insensitive search
ALTER TABLE users ADD COLUMN username_lower TEXT 
    GENERATED ALWAYS AS (LOWER(username)) STORED;

SELECT biscuit_setup('users', 'username_lower', 'user_id');

Query Examples:

-- Search usernames
SELECT user_id, username, display_name, follower_count
FROM biscuit_match('%tech%')
ORDER BY follower_count DESC;

-- Find available similar usernames
WITH desired_username AS (
    SELECT 'john_doe' AS base
),
existing_matches AS (
    SELECT username 
    FROM biscuit_match('john_doe%')
)
SELECT 
    base || '_' || generate_series(1, 10) AS suggestion
FROM desired_username
WHERE base || '_' || generate_series(1, 10) NOT IN (
    SELECT username FROM existing_matches
)
LIMIT 5;

-- Find influencers matching pattern
SELECT username, display_name, follower_count
FROM biscuit_match('%guru%')
WHERE follower_count > 10000
ORDER BY follower_count DESC;

-- Username analytics
WITH username_patterns AS (
    SELECT 
        CASE 
            WHEN username LIKE '%\_\_\_%' THEN 'has_underscores'
            WHEN username LIKE '%[0-9]%' THEN 'has_numbers'
            ELSE 'alphanumeric'
        END AS pattern_type,
        follower_count
    FROM biscuit_match('%')
)
SELECT 
    pattern_type,
    COUNT(*) AS user_count,
    AVG(follower_count) AS avg_followers
FROM username_patterns
GROUP BY pattern_type;

Example 6: Document Management System

Scenario: Legal document repository, need full-text-like search on titles

Schema:

CREATE TABLE documents (
    doc_id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    document_type VARCHAR(50),
    file_path TEXT,
    uploaded_by INTEGER,
    uploaded_at TIMESTAMPTZ DEFAULT NOW(),
    case_number VARCHAR(50)
);

-- Sample data
INSERT INTO documents (title, document_type, case_number) VALUES
    ('Contract Agreement - Smith vs Johnson', 'Contract', 'CASE-2024-001'),
    ('Patent Application for AI Technology', 'Patent', 'PAT-2024-042'),
    ('Lease Agreement - Commercial Property', 'Lease', 'CASE-2024-002');

BISCUIT Setup:

-- Create searchable column with normalized text
ALTER TABLE documents ADD COLUMN title_search TEXT 
    GENERATED ALWAYS AS (LOWER(REGEXP_REPLACE(title, '[^a-z0-9 ]', '', 'gi'))) STORED;

SELECT biscuit_setup('documents', 'title_search', 'doc_id');

Query Examples:

-- Find contract documents
SELECT * FROM biscuit_match('%contract%');

-- Multi-term search (all terms must appear)
SELECT * 
FROM biscuit_match('%lease%commercial%')
WHERE document_type = 'Lease';

-- Case document search
SELECT doc_id, title, case_number, uploaded_at
FROM biscuit_match('%smith%johnson%')
ORDER BY uploaded_at DESC;

-- Document type distribution
WITH doc_matches AS (
    SELECT document_type, COUNT(*) AS cnt
    FROM biscuit_match('%agreement%')
    GROUP BY document_type
)
SELECT 
    document_type,
    cnt,
    ROUND(100.0 * cnt / SUM(cnt) OVER (), 2) AS percentage
FROM doc_matches
ORDER BY cnt DESC;

Example 7: Bulk Data Migration with BISCUIT

Scenario: Migrating 10M records from legacy system

Migration Script:

-- Step 1: Create target table
CREATE TABLE products_new (
    id INTEGER PRIMARY KEY,
    sku TEXT,
    name TEXT,
    description TEXT
);

-- Step 2: Prepare BISCUIT (without triggers for bulk)
SELECT biscuit_build_index('products_new', 'name', 'id');

-- Step 3: Bulk load data
\timing on
SELECT biscuit_disable_triggers();

COPY products_new FROM '/data/products.csv' CSV HEADER;
-- Alternative: INSERT SELECT from staging

Time: 45,000 ms

-- Step 4: Rebuild index
SELECT biscuit_build_index('products_new', 'name', 'id');
Time: 38,000 ms

-- Step 5: Enable triggers for future updates
SELECT biscuit_enable_triggers();
\timing off

-- Step 6: Verify
SELECT 
    COUNT(*) AS table_count,
    biscuit_get_active_count() AS index_count,
    biscuit_get_free_slots() AS free_slots
FROM products_new;

-- Step 7: Test queries
SELECT biscuit_match_count('%');  -- Should match table count

Example 8: Multi-Pattern Search Interface

Scenario: Build API endpoint that accepts multiple search patterns

Application Code (Python + psycopg2):

import psycopg2
from typing import List, Dict

def multi_pattern_search(patterns: List[str], table: str) -> List[Dict]:
    """
    Search for records matching any of the provided patterns
    """
    conn = psycopg2.connect("dbname=mydb user=myuser")
    cur = conn.cursor()
    
    # Build UNION query for multiple patterns
    union_queries = []
    for pattern in patterns:
        union_queries.append(
            f"SELECT * FROM biscuit_match(%s)"
        )
    
    query = " UNION ".join(union_queries)
    
    cur.execute(query, patterns)
    results = cur.fetchall()
    
    cur.close()
    conn.close()
    
    return results

# Usage
results = multi_pattern_search(
    patterns=['%laptop%', '%macbook%', '%computer%'],
    table='products'
)

SQL Equivalent:

-- Search multiple patterns
SELECT * FROM biscuit_match('%laptop%')
UNION
SELECT * FROM biscuit_match('%macbook%')
UNION
SELECT * FROM biscuit_match('%computer%')
ORDER BY id;

-- Or with CTE for clarity
WITH 
pattern1 AS (SELECT * FROM biscuit_match('%laptop%')),
pattern2 AS (SELECT * FROM biscuit_match('%macbook%')),
pattern3 AS (SELECT * FROM biscuit_match('%computer%'))
SELECT * FROM pattern1
UNION
SELECT * FROM pattern2
UNION
SELECT * FROM pattern3;

Example 9: Performance Monitoring Dashboard

Scenario: Real-time monitoring of BISCUIT performance

Monitoring Queries:

-- Create monitoring function
CREATE OR REPLACE FUNCTION biscuit_performance_metrics()
RETURNS TABLE (
    metric TEXT,
    value NUMERIC,
    unit TEXT,
    status TEXT
) AS $
DECLARE
    active_ct INTEGER;
    tombstone_ct INTEGER;
    free_ct INTEGER;
    memory_val NUMERIC;
    query_ct BIGINT;
    insert_ct BIGINT;
    update_ct BIGINT;
    delete_ct BIGINT;
BEGIN
    active_ct := biscuit_get_active_count();
    tombstone_ct := biscuit_get_tombstone_count();
    free_ct := biscuit_get_free_slots();
    memory_val := (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC;
    query_ct := (regexp_matches(biscuit_index_status(), 'Queries: ([0-9]+)'))[1]::BIGINT;
    insert_ct := (regexp_matches(biscuit_index_status(), 'Inserts: ([0-9]+)'))[1]::BIGINT;
    update_ct := (regexp_matches(biscuit_index_status(), 'Updates: ([0-9]+)'))[1]::BIGINT;
    delete_ct := (regexp_matches(biscuit_index_status(), 'Deletes: ([0-9]+)'))[1]::BIGINT;
    
    RETURN QUERY SELECT 'Active Records'::TEXT, active_ct::NUMERIC, 'records'::TEXT, 
        CASE WHEN active_ct > 0 THEN 'OK' ELSE 'WARNING' END;
    RETURN QUERY SELECT 'Tombstones', tombstone_ct::NUMERIC, 'records', 
        CASE WHEN tombstone_ct < 5000 THEN 'OK' WHEN tombstone_ct < 10000 THEN 'WARNING' ELSE 'CRITICAL' END;
    RETURN QUERY SELECT 'Free Slots', free_ct::NUMERIC, 'slots', 'INFO';
    RETURN QUERY SELECT 'Memory Usage', memory_val, 'MB', 
        CASE WHEN memory_val < 100 THEN 'OK' WHEN memory_val < 500 THEN 'WARNING' ELSE 'HIGH' END;
    RETURN QUERY SELECT 'Total Queries', query_ct::NUMERIC, 'operations', 'INFO';
    RETURN QUERY SELECT 'Total Inserts', insert_ct::NUMERIC, 'operations', 'INFO';
    RETURN QUERY SELECT 'Total Updates', update_ct::NUMERIC, 'operations', 'INFO';
    RETURN QUERY SELECT 'Total Deletes', delete_ct::NUMERIC, 'operations', 'INFO';
END;
$ LANGUAGE plpgsql;

-- Use in dashboard
SELECT * FROM biscuit_performance_metrics();

Output:

       metric       | value | unit       | status
--------------------+-------+------------+--------
 Active Records     | 998567| records    | OK
 Tombstones         | 856   | records    | OK
 Free Slots         | 1433  | slots      | INFO
 Memory Usage       | 18.3  | MB         | OK
 Total Queries      | 45892 | operations | INFO
 Total Inserts      | 1000000| operations| INFO
 Total Updates      | 125678| operations | INFO
 Total Deletes      | 1433  | operations | INFO

Complete Production Example

Full Setup for Production Environment:

-- 1. Create schema
CREATE SCHEMA production;
SET search_path TO production;

-- 2. Create table
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(100) UNIQUE NOT NULL,
    name TEXT NOT NULL,
    description TEXT,
    category VARCHAR(100),
    price NUMERIC(12, 2),
    stock_level INTEGER DEFAULT 0,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- 3. Add indexes
CREATE INDEX idx_products_category ON products(category);
CREATE INDEX idx_products_price ON products(price);

-- 4. Setup BISCUIT
SELECT biscuit_setup('products', 'name', 'id');

-- 5. Verify setup
SELECT biscuit_index_status();

-- 6. Insert sample data
INSERT INTO products (sku, name, category, price, stock_level)
SELECT 
    'SKU-' || generate_series(1, 100000),
    'Product ' || generate_series(1, 100000),
    (ARRAY['Electronics', 'Furniture', 'Clothing'])[floor(random() * 3 + 1)],
    (random() * 1000)::NUMERIC(12, 2),
    floor(random() * 100)::INTEGER;

-- 7. Test queries
\timing on
SELECT * FROM biscuit_match('%Product 1234%');
SELECT biscuit_match_count('%Electronics%');
\timing off

-- 8. Setup monitoring
CREATE TABLE biscuit_metrics_log (
    logged_at TIMESTAMPTZ DEFAULT NOW(),
    active_count INTEGER,
    tombstone_count INTEGER,
    memory_mb NUMERIC,
    query_count BIGINT
);

-- 9. Schedule monitoring (via pg_cron)
SELECT cron.schedule(
    'biscuit-metrics',
    '*/5 * * * *',  -- Every 5 minutes
    $INSERT INTO production.biscuit_metrics_log 
      SELECT NOW(), biscuit_get_active_count(), biscuit_get_tombstone_count(),
             (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC,
             (regexp_matches(biscuit_index_status(), 'Queries: ([0-9]+)'))[1]::BIGINT$
);

-- 10. Create API views
CREATE VIEW product_search AS
SELECT id, sku, name, category, price
FROM products
WHERE id IN (SELECT id FROM biscuit_match('%'));  -- Template view

License

PostgreSQL License (similar to MIT/BSD)

Permission to use, copy, modify, and distribute this software and its documentation for any purpose, without fee, and without a written agreement is hereby granted, provided that the above copyright notice and this paragraph and the following paragraphs appear in all copies.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Version History

v1.0.0

Initial release
Roaring Bitmap integration
Lazy deletion with tombstones
Incremental updates
O(1) hash table PK lookup
Automatic trigger-based CRUD
Comprehensive monitoring

v1.0.1

Updated README

Contributors

BISCUIT is developed and maintained by Sivaprasad Murali .

Support and Contact

Issues: https://github.com/crystallinecore/biscuit/issues

Discussions: https://github.com/crystallinecore/biscuit/discussions

When pg_trgm feels half-baked, grab a BISCUIT 🍪

PGXN

PostgreSQL Extension Network

Contents

BISCUIT: Bitmap Indexed Searching with Combinatorial Union and Intersection Techniques

PostgreSQL Extension for High-Performance Wildcard Pattern Matching

Version 1.0.1

Table of Contents

Executive Summary

Key Capabilities

Performance Characteristics

Technical Overview

Problem Statement

Solution Architecture

Core Index Structures

Algorithmic Approach

Data Flow

Installation and Requirements

System Requirements

Optional Dependencies

Installation Procedure

Method 1: PostgreSQL Extension System

Method 2: Manual Compilation

Verification

Configuration and Setup

Initialization Options

Option 1: Automated Setup (Recommended)

Option 2: Manual Configuration (Advanced)

Primary Key Considerations

Index Scope and Limitations

Query Interface

Function Signatures and Return Types

Primary Query Functions

Query Patterns and SQL Integration

Standard SQL Operations

Performance Considerations

Pattern Matching Specification

Wildcard Operators

Pattern Categories and Optimization Paths

1. Exact Match Patterns

2. Prefix Patterns

3. Suffix Patterns

4. Containment Patterns

5. Complex Multi-Part Patterns

6. Single-Character Wildcard Patterns

Pattern Syntax Reference

Special Cases

Empty Pattern

Match All

Case Sensitivity

NULL Handling

CRUD Operations and Index Maintenance

Automatic Synchronization

INSERT Operations

UPDATE Operations

Incremental Update Path

Full Reindex Path

DELETE Operations

Tombstone Cleanup Process

Upsert Operations (INSERT … ON CONFLICT)

Transaction Semantics

Trigger Management

Enable Triggers

Disable Triggers

Check Trigger Status

Architecture and Implementation

System Components

Data Structure Details

1. Positional Character Index

2. Character Existence Cache

3. Length Index

4. Primary Key Hash Table

5. Lazy Deletion System

Memory Management

Allocation Strategy

Garbage Collection

Concurrency Model

Advanced Features

Incremental Updates

Lazy Deletion with Smart Cleanup

Pattern Optimization