Repository Guidelines

Repository Guidelines

What This Is

Kazakh stemmer for PostgreSQL full-text search. BFS suffix-stripping over ordered morphological layers (noun: DERIV→PLUR→POSS→CASE→PRED, verb: VVOICE→VNEG→VTENSE→VPERSON) with vowel harmony enforcement, penalty-based candidate scoring, optional lexicon verification, and morphophonological stem repair. ~80.88% token coverage.

No prior Kazakh stemmer exists for PostgreSQL or Elasticsearch. This is the first.

Architecture

Cargo workspace with a shared core library and multiple consumers:

core/ — kazsearch-core: pure Rust stemmer (BFS engine, suffix rules, vowel harmony, penalty scoring, lexicon, stem repair). No Postgres dependencies.
pg_ext/ — pg_kazsearch: pgrx-based PostgreSQL extension. Thin wrapper that calls kazsearch_core::stem().
cli/ — kazsearch-cli: CLI tool (kazsearch stem, analyze, bench, lexicon validate).
elastic/ — kazsearch-elastic: Elasticsearch plugin (placeholder).
legacy/pg_kazsearch_c/ — archived original C implementation (reference only, not built).

Key core modules:

core/src/explore.rs — BFS engine, visit set, penalty scorer, stem repair. The heart.
core/src/text.rs — UTF-8 iteration, vowel classification, harmony checks.
core/src/rules.rs — Suffix tables for noun and verb layers.
core/src/lexicon.rs — Lexicon loader.
core/src/lib.rs — stem() entry point, winner selection, sound change undo.

Supporting: scripts/ (lexicon builder), eval/ (scraper, corpus loader, evaluator, CMA-ES optimizer), docker/ (dev container).

Commands

All via just.

Command	What it does
`just up` / `just down`	Start/stop Postgres container
`just build`	Build lexicon + compile Rust extension + install
`just reload`	Build + DROP/CREATE EXTENSION
`just cli`	Build CLI tool
`just test-core`	Run core library unit tests
`just test-ext`	Smoke-test stemmer output via SQL
`just psql`	Interactive psql
`just pipeline`	Full eval: scrape → load → gen queries → evaluate
`just optimize`	CMA-ES penalty weight optimization
`just apply-weights`	Push optimized weights to running DB

Style

Rust: Standard rustfmt. Public API in core/src/lib.rs. Modules mirror the C design: explore, text, rules, lexicon.

Python: snake_case, standalone argparse CLIs.

Commits: Conventional Commits (feat:, fix:, refactor:). One logical change per commit. just build && just test-ext must pass.

Critical Context

Kazakh is agglutinative — words stack 5-6 suffixes. Greedy stripping fails; BFS is necessary.
Vowel harmony (back/front) is mandatory for suffix validation. Glides (у, и, ю) are transparent.
Penalty constants in candidate_penalty (core/src/explore.rs) are empirically tuned via CMA-ES against a real corpus. Changing one can break others.
Stem repair reverses morphophonological changes: consonant mutation (б→п, ғ→қ, г→к), vowel elision, and lexicon-based vowel restore.
The lexicon safety valve prevents overstemming: if the input word is already in the dictionary and the candidate looks suspicious, return input unchanged.
Layer guards in core/src/explore.rs encode real morphotactic constraints — they are not optional and each one prevents a class of mis-stems.

Issue Tracking with bd (beads)

IMPORTANT: This project uses bd (beads) for ALL issue tracking. Do NOT use markdown TODOs, task lists, or other tracking methods.

Why bd?

Dependency-aware: Track blockers and relationships between issues
Git-friendly: Dolt-powered version control with native sync
Agent-optimized: JSON output, ready work detection, discovered-from links
Prevents duplicate tracking systems and confusion

Quick Start

Check for ready work:

bd ready --json

Create new issues:

bd create "Issue title" --description="Detailed context" -t bug|feature|task -p 0-4 --json
bd create "Issue title" --description="What this issue is about" -p 1 --deps discovered-from:bd-123 --json

# Use stdin for descriptions with special characters (backticks, !, nested quotes)
echo 'Description with `backticks` and "quotes"' | bd create "Title" --description=- --json

Claim and update:

bd update <id> --claim --json
bd update bd-42 --priority 1 --json

Complete work:

bd close bd-42 --reason "Completed" --json

Issue Types

bug - Something broken
feature - New functionality
task - Work item (tests, docs, refactoring)
epic - Large feature with subtasks
chore - Maintenance (dependencies, tooling)

Priorities

0 - Critical (security, data loss, broken builds)
1 - High (major features, important bugs)
2 - Medium (default, nice-to-have)
3 - Low (polish, optimization)
4 - Backlog (future ideas)

Workflow for AI Agents

Check ready work: bd ready shows unblocked issues
Claim your task atomically: bd update <id> --claim
Work on it: Implement, test, document
Discover new work? Create linked issue:
- bd create "Found bug" --description="Details about what was found" -p 1 --deps discovered-from:<parent-id>
Complete: bd close <id> --reason "Done"

Auto-Sync

bd automatically syncs via Dolt:

Each write auto-commits to Dolt history
Use bd dolt push / bd dolt pull for remote sync
No manual export/import needed!

Important Rules

Use bd for ALL task tracking
Always use --json flag for programmatic use
Link discovered work with discovered-from dependencies
Check bd ready before asking “what should I work on?”
Do NOT create markdown TODO lists
Do NOT use external issue trackers
Do NOT duplicate tracking systems

Landing the Plane (Session Completion)

When ending a work session, you MUST complete ALL steps below. Work is NOT complete until git push succeeds.

MANDATORY WORKFLOW:

File issues for remaining work - Create issues for anything that needs follow-up
Run quality gates (if code changed) - Tests, linters, builds
Update issue status - Close finished work, update in-progress items
PUSH TO REMOTE - This is MANDATORY: bash git pull --rebase bd dolt push git push git status # MUST show "up to date with origin"
Clean up - Clear stashes, prune remote branches
Verify - All changes committed AND pushed
Hand off - Provide context for next session

CRITICAL RULES: - Work is NOT complete until git push succeeds - NEVER stop before pushing - that leaves work stranded locally - NEVER say “ready to push when you are” - YOU must push - If push fails, resolve and retry until it succeeds

PGXN

PostgreSQL Extension Network

Contents