Contents
- Biscuit - High-Performance Pattern Matching Index for PostgreSQL
- Installation
- Quick Start
- How It Works
- 12 Performance Optimizations
- 1. Skip Wildcard Intersections
- 2. Early Termination on Empty
- 3. Avoid Redundant Bitmap Copies
- 4. Optimized Single-Part Patterns
- 5. Skip Unnecessary Length Operations
- 6. TID Sorting for Sequential Heap Access
- 7. Batch TID Insertion
- 8. Direct Roaring Iteration
- 9. Batch Cleanup on Threshold
- 10. Aggregate Query Detection
- 11. LIMIT-Aware TID Collection
- 12. Multi-Column Query Optimization
- Benchmarking
- Use Cases
- Configuration
- Limitations and Trade-offs
- Comparison with pg_trgm
- Development
- Architecture Details
- Contributing
- License
- Author
- Acknowledgments
- Support
Biscuit - High-Performance Pattern Matching Index for PostgreSQL
Biscuit is a specialized PostgreSQL index access method (IAM) designed for blazing-fast pattern matching on LIKE queries, with native support for multi-column searches. It eliminates the recheck overhead of trigram indexes while delivering significant performance improvements on wildcard-heavy queries. It stands for Bitmap Indexed Searching with Comprehensive Union and Intersection Techniques.
What’s new?
v2.0.1 (2024-12-06)
- Fixed incorrect results when using multiple
LIKE/NOT LIKEpredicates on the same column. - Root cause: global (not per-predicate) bitmap inversion.
- Fix: correct per-predicate bitmap handling; all multi-filter patterns now return accurate results.
- Added full, efficient support for the
NOT LIKEoperator. - Backward-compatible; recommended to update and re-verify critical queries.
Installation
Requirements
- Build tools:
gcc,make,pg_config - Optional: CRoaring library for enhanced performance
From Source
# Clone repository
git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit
# Build and install
make
sudo make install
# Enable in PostgreSQL
psql -d your_database -c "CREATE EXTENSION biscuit;"
From PGXN
pgxn install biscuit
psql -d your_database -c "CREATE EXTENSION biscuit;"
Quick Start
Basic Usage
-- Create a Biscuit index
CREATE INDEX idx_users_name ON users USING biscuit(name);
-- Query with wildcard patterns
SELECT * FROM users WHERE name LIKE '%john%';
SELECT * FROM users WHERE name NOT LIKE 'a%b%c';
SELECT COUNT(*) FROM users WHERE name LIKE '%test%';
Multi-Column Indexes
-- Create multi-column index
CREATE INDEX idx_products_search
ON products USING biscuit(name, description, category);
-- Multi-column query (optimized automatically)
SELECT * FROM products
WHERE name LIKE '%widget%'
AND description LIKE '%blue%'
AND category LIKE 'electronics%'
ORDER BY score DESC
LIMIT 10;
Supported Data Types
Biscuit automatically converts various types to searchable text:
-- Text types (native)
CREATE INDEX ON logs USING biscuit(message);
-- Numeric types (converted to sortable strings)
CREATE INDEX ON events USING biscuit(user_id, event_code);
-- Date/Time types (converted to sortable timestamps)
CREATE INDEX ON orders USING biscuit(order_date, customer_name);
-- Boolean (converted to 't'/'f')
CREATE INDEX ON flags USING biscuit(is_active, status);
How It Works
Core Concept: Bitmap Position Indices
Biscuit builds two types of character-position bitmaps for every string:
1. Positive Indices (Forward)
Tracks which records have character c at position p:
String: "hello"
Bitmaps:
h@0 → {record_ids...}
e@1 → {record_ids...}
l@2 → {record_ids...}
l@3 → {record_ids...}
o@4 → {record_ids...}
2. Negative Indices (Backward)
Tracks which records have character c at position -p from the end:
String: "hello"
Bitmaps:
o@-1 → {record_ids...} (last char)
l@-2 → {record_ids...} (second to last)
l@-3 → {record_ids...}
e@-4 → {record_ids...}
h@-5 → {record_ids...}
3. Length Bitmaps
Two types for fast length filtering:
- Exact length:
length[5]→ all 5-character strings - Minimum length:
length_ge[3]→ all strings ≥ 3 characters
Pattern Matching Algorithm
Example: LIKE '%abc%def'
Step 1: Parse pattern into parts
Parts: ["abc", "def"]
Starts with %: YES
Ends with %: NO
Step 2: Match first part as prefix
-- "abc" must start at position 0
Candidates = pos[a@0] ∩ pos[b@1] ∩ pos[c@2]
Step 3: Match last part at end (negative indexing)
-- "def" must end at string end
Candidates = Candidates ∩ neg[f@-1] ∩ neg[e@-2] ∩ neg[d@-3]
Step 4: Apply length constraint
-- String must be at least 6 chars (abc + def)
Candidates = Candidates ∩ length_ge[6]
Result: Exact matches, zero false positives
Why It’s Fast
1. Pure Bitmap Operations
// Traditional approach (pg_trgm)
for each trigram in pattern:
candidates = scan_trigram_index(trigram)
for each candidate:
if !heap_fetch_and_recheck(candidate): // SLOW: Random I/O
remove candidate
// Biscuit approach
for each character at position:
candidates &= bitmap[char][pos] // FAST: In-memory AND
// No recheck needed!
2. Roaring Bitmaps
Compressed bitmap representation:
- Sparse data: array of integers
- Dense data: bitset
- Automatic conversion for optimal memory
3. Negative Indexing Optimization
-- Pattern: '%xyz'
-- Traditional: Scan all strings, check suffix
-- Biscuit: Direct lookup in neg[z@-1] ∩ neg[y@-2] ∩ neg[x@-3]
12 Performance Optimizations
1. Skip Wildcard Intersections
// Pattern: "a_c" (underscore = any char)
// OLD: Intersect all 256 chars at position 1
// NEW: Skip position 1 entirely, only check a@0 and c@2
2. Early Termination on Empty
result = bitmap[a][0];
result &= bitmap[b][1];
if (result.empty()) return empty; // Don't process remaining chars
3. Avoid Redundant Bitmap Copies
// OLD: Copy bitmap for every operation
// NEW: Operate in-place, copy only when branching
4. Optimized Single-Part Patterns
Fast paths for common cases:
- Exact:
'abc'→ Check position 0-2 and length = 3 - Prefix:
'abc%'→ Check position 0-2 and length ≥ 3 - Suffix:
'%xyz'→ Check negative positions -3 to -1 - Substring:
'%abc%'→ Check all positions, OR results
5. Skip Unnecessary Length Operations
// Pure wildcard patterns
if (pattern == "%%%___%%") // 3 underscores
return length_ge[3]; // No character checks needed!
6. TID Sorting for Sequential Heap Access
// Sort TIDs by (block_number, offset) before returning
// Converts random I/O into sequential I/O
// Uses radix sort for >5000 TIDs, quicksort for smaller sets
7. Batch TID Insertion
// For bitmap scans, insert TIDs in chunks
for (i = 0; i < num_results; i += 10000) {
tbm_add_tuples(tbm, &tids[i], batch_size, false);
}
8. Direct Roaring Iteration
// OLD: Convert bitmap to array, then iterate
// NEW: Direct iterator, no intermediate allocation
roaring_uint32_iterator_t *iter = roaring_create_iterator(bitmap);
while (iter->has_value) {
process(iter->current_value);
roaring_advance_uint32_iterator(iter);
}
9. Batch Cleanup on Threshold
// After 1000 deletes, clean tombstones from all bitmaps
if (tombstone_count >= 1000) {
for each bitmap:
bitmap &= ~tombstones; // Batch operation
tombstones.clear();
}
10. Aggregate Query Detection
// COUNT(*), EXISTS, etc. don't need sorted TIDs
if (!scan->xs_want_itup) {
skip_sorting = true; // Save sorting time
}
11. LIMIT-Aware TID Collection
// If LIMIT 10 in query, don't collect more than needed
if (limit_hint > 0 && collected >= limit_hint)
break; // Early termination
12. Multi-Column Query Optimization
Predicate Reordering
Analyzes each column’s pattern and executes in order of selectivity:
-- Query:
WHERE name LIKE '%common%' -- Low selectivity
AND sku LIKE 'PROD-2024-%' -- High selectivity (prefix)
AND description LIKE '%rare_word%' -- Medium selectivity
-- Execution order (Biscuit automatically reorders):
1. sku LIKE 'PROD-2024-%' (PREFIX, priority=20, selectivity=0.02)
2. description LIKE '%rare_word%' (SUBSTRING, priority=35, selectivity=0.15)
3. name LIKE '%common%' (SUBSTRING, priority=55, selectivity=0.60)
Selectivity scoring formula:
score = 1.0 / (concrete_chars + 1)
- (underscore_count × 0.05)
+ (partition_count × 0.15)
- (anchor_strength / 200)
Priority tiers:
- 0-10: Exact matches, many underscores
- 10-20: Non-% patterns with underscores
- 20-30: Strong anchored patterns (prefix/suffix)
- 30-40: Weak anchored patterns
- 40-50: Multi-partition patterns
- 50-60: Substring patterns (lowest priority)
Benchmarking
Setup Test Data
-- Create 1M row test table
CREATE TABLE benchmark (
id SERIAL PRIMARY KEY,
name TEXT,
description TEXT,
category TEXT,
score FLOAT
);
INSERT INTO benchmark (name, description, category, score)
SELECT
'Name_' || md5(random()::text),
'Description_' || md5(random()::text),
'Category_' || (random() * 100)::int,
random() * 1000
FROM generate_series(1, 1000000);
-- Create indexes
CREATE INDEX idx_trgm ON benchmark
USING gin(name gin_trgm_ops, description gin_trgm_ops);
CREATE INDEX idx_biscuit ON benchmark
USING biscuit(name, description, category);
ANALYZE benchmark;
Run Benchmarks
-- Single column, simple pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark WHERE name LIKE '%abc%' LIMIT 100;
-- Multi-column, complex pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark
WHERE name LIKE '%a%b'
AND description LIKE '%bc%cd%'
ORDER BY score DESC
LIMIT 10;
-- Aggregate query (COUNT)
EXPLAIN ANALYZE
SELECT COUNT(*) FROM benchmark
WHERE name LIKE 'a%l%'
AND category LIKE 'f%d';
-- Complex multi-part pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark
WHERE description LIKE 'u%dc%x'
LIMIT 50;
View Index Statistics
-- Show internal statistics
SELECT biscuit_index_stats('idx_biscuit'::regclass);
Output:
----------------------------------------------------
Biscuit Index Statistics (FULLY OPTIMIZED) +
========================================== +
Index: idx_biscuit +
Active records: 1000002 +
Total slots: 1000002 +
Free slots: 0 +
Tombstones: 0 +
Max length: 44 +
------------------------ +
CRUD Statistics: +
Inserts: 0 +
Updates: 0 +
Deletes: 0 +
------------------------ +
Active Optimizations: +
✓ 1. Skip wildcard intersections +
✓ 2. Early termination on empty +
✓ 3. Avoid redundant copies +
✓ 4. Optimized single-part patterns +
✓ 5. Skip unnecessary length ops +
✓ 6. TID sorting for sequential I/O +
✓ 7. Batch TID insertion +
✓ 8. Direct bitmap iteration +
✓ 9. Parallel bitmap scan support +
✓ 10. Batch cleanup on threshold +
✓ 11. Skip sorting for bitmap scans (aggregates)+
✓ 12. LIMIT-aware TID collection +
Results
> Build time
| Index | Command | Build Time |
| ———– | —————————— | —————– |
| pg_trgm | CREATE INDEX idx_trgm ... | 20,358.655 ms |
| biscuit | CREATE INDEX idx_biscuit ... | 2,734.310 ms |
Use Cases
1. Full-Text Search Applications
-- E-commerce product search
CREATE INDEX idx_products ON products
USING biscuit(name, brand, description);
SELECT * FROM products
WHERE name LIKE '%laptop%'
AND brand LIKE 'ABC%'
AND description LIKE '%gaming%'
ORDER BY price DESC
LIMIT 20;
2. Log Analysis
-- Search error logs
CREATE INDEX idx_logs ON logs
USING biscuit(message, source, level);
SELECT * FROM logs
WHERE message LIKE '%ERROR%connection%timeout%'
AND source LIKE 'api.%'
AND timestamp > NOW() - INTERVAL '1 hour'
LIMIT 100;
3. Customer Support / CRM
-- Search tickets by multiple fields
CREATE INDEX idx_tickets ON tickets
USING biscuit(subject, description, customer_name);
SELECT * FROM tickets
WHERE subject LIKE '%refund%'
AND customer_name LIKE 'John%'
AND status = 'open'
ORDER BY created_at DESC;
4. Code Search / Documentation
-- Search code repositories
CREATE INDEX idx_files ON code_files
USING biscuit(filename, content, author);
SELECT * FROM code_files
WHERE filename LIKE '%.py'
AND content LIKE '%def%parse%json%'
AND author LIKE 'team-%';
5. Analytics with Aggregates
-- Fast COUNT queries (no sorting overhead)
CREATE INDEX idx_events ON events
USING biscuit(event_type, user_agent, referrer);
SELECT COUNT(*) FROM events
WHERE event_type LIKE 'click%'
AND user_agent LIKE '%Mobile%'
AND referrer LIKE '%google%';
Configuration
Build Options
Enable CRoaring for better performance:
Index Options
Currently, Biscuit doesn’t expose tunable options. All optimizations are automatic.
Limitations and Trade-offs
What Biscuit Does NOT Support
- Regular expressions - Only
LIKE/ILIKEpatterns with%and_ - Case-insensitive by default - Use PostgreSQL’s
ILIKEfor case-insensitive - Locale-specific collations - String comparisons are byte-based
- Amcanorder = false - Cannot provide ordered scans directly (but see below)
ORDER BY + LIMIT Behavior
Biscuit doesn’t support ordered index scans (amcanorder = false), BUT:
PostgreSQL’s planner handles this efficiently:
SELECT * FROM table WHERE col LIKE '%pattern%' ORDER BY score LIMIT 10;
Execution plan:
Limit
-> Sort (cheap, small result set)
-> Biscuit Index Scan (fast filtering)
Why this works:
- Biscuit filters candidates extremely fast
- Result set is small after filtering
- Sorting 100-1000 rows in memory is negligible (<1ms)
- Net result: Still much faster than pg_trgm with recheck overhead
Memory Usage
Biscuit stores bitmaps in memory:
- Use
REINDEXto rebuild if index grows too large
Write Performance
- INSERT: Similar to B-tree (must update bitmaps)
- UPDATE: Two operations (remove old, insert new)
- DELETE: Marks as tombstone, batch cleanup at threshold
Comparison with pg_trgm
| Feature | Biscuit | pg_trgm (GIN) | |–––––––––––––|———————–|–––––––––––| | Wildcard patterns | ✔ Native, exact | ✔ Approximate | | Recheck overhead | ✔ None (deterministic) | ✗ Always required | | Multi-column | ✔ Optimized | ⚠️ Via btree_gist | | Aggregate queries | ✔ Optimized | ✗ Same cost | | ORDER BY + LIMIT | ✔ Works well | ✔ Ordered scans | | Memory usage | ⚠️ Higher | ✔ Lower | | Regex support | ✗ No | ✔ Yes | | Similarity search | ✗ No | ✔ Yes |
When to use Biscuit:
- ✔ Wildcard-heavy
LIKEqueries (%,_) - ✔ Multi-column pattern matching
- ✔ Need exact results (no false positives)
- ✔
COUNT(*)/ aggregate queries - ✔ High query volume, can afford memory
When to use pg_trgm:
- ✔ Fuzzy/similarity search (
word <-> pattern) - ✔ Regular expressions
- ✔ Memory-constrained environments
- ✔ Write-heavy workloads
Development
Build from Source
git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit
# Development build with debug symbols
make clean
CFLAGS="-g -O0 -DDEBUG" make
# Run tests
make installcheck
# Install
sudo make install
Testing
# Unit tests
make installcheck
# Manual testing
psql -d testdb
CREATE EXTENSION biscuit;
-- Create test table
CREATE TABLE test (id SERIAL, name TEXT);
INSERT INTO test (name) VALUES ('hello'), ('world'), ('test');
-- Create index
CREATE INDEX idx_test ON test USING biscuit(name);
-- Test queries
EXPLAIN ANALYZE SELECT * FROM test WHERE name LIKE '%ell%';
Debugging
Enable PostgreSQL debug logging:
SET client_min_messages = DEBUG1;
SET log_min_messages = DEBUG1;
-- Now run queries to see Biscuit's internal logs
SELECT * FROM test WHERE name LIKE '%pattern%';
Architecture Details
Index Structure
BiscuitIndex
├── num_columns: int
├── column_indices[]: ColumnIndex[]
│ ├── pos_idx[256]: CharIndex // Forward position bitmaps
│ │ └── entries[]: PosEntry[]
│ │ ├── pos: int
│ │ └── bitmap: RoaringBitmap
│ ├── neg_idx[256]: CharIndex // Backward position bitmaps
│ ├── char_cache[256]: RoaringBitmap // Character existence
│ ├── length_bitmaps[]: RoaringBitmap[] // Exact lengths
│ └── length_ge_bitmaps[]: RoaringBitmap[] // Min lengths
├── tids[]: ItemPointerData[] // Record TIDs
├── column_data_cache[][]: char** // Cached string data
└── tombstones: RoaringBitmap // Deleted records
Query Execution Flow
1. biscuit_rescan()
├─> Parse LIKE pattern into parts
├─> Analyze pattern selectivity (multi-column)
├─> Reorder predicates by priority
└─> For each predicate:
├─> biscuit_query_column_pattern()
│ ├─> Check fast paths (empty, %, pure wildcards)
│ ├─> Match pattern parts using bitmaps
│ └─> Return candidate bitmap
└─> Intersect with previous candidates
2. biscuit_collect_tids_optimized()
├─> Detect aggregate vs. regular query
├─> Estimate LIMIT hint
├─> Collect TIDs from final bitmap
├─> Sort if needed (skip for aggregates)
└─> Apply LIMIT early termination
3. biscuit_gettuple() or biscuit_getbitmap()
└─> Return results to PostgreSQL executor
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing) - Make your changes with tests
- Submit a pull request
Areas for Contribution
- [ ] Add
ILIKEsupport (case-insensitive) - [ ] Implement
amcanorderfor native sorted scans - [ ] Add statistics collection for better cost estimation
- [ ] Support for more data types (JSON, arrays)
- [ ] Parallel index build
- [ ] Index compression options
License
MIT License - See LICENSE file for details.
Author
Sivaprasad Murali
- Email: sivaprasad.off@gmail.com
- GitHub: @Crystallinecore
Acknowledgments
- PostgreSQL community for the extensible index AM framework
- CRoaring library for efficient bitmap operations
- Inspired by the need for faster LIKE query performance in production systems
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Happy pattern matching! Grab a biscuit 🍪 when pg_trgm feels half-baked!