kham_pg 0.6.0

This Release
kham_pg 0.6.0
Date
Status
Stable
Latest Stable
kham_pg 0.8.2 —
Other Releases
Abstract
Thai word-segmentation FTS parser — tsvector, soundex, RTGS romanization, NER
Description
kham_pg is a PostgreSQL text-search parser for the Thai language. Thai has no spaces between words, so standard PostgreSQL parsers produce incorrect token boundaries. kham_pg uses the kham newmm segmentation engine to split Thai text correctly, then expands each token into up to three lexemes at the same tsvector position: the normalised word, its lk82 Thai Soundex code (phonetic-fuzzy search), and its RTGS romanization (Latin-script search). Named entities (persons, places, organisations) are tagged automatically. Supports to_tsvector, plainto_tsquery, ts_rank, ts_headline, and GIN/GiST indexes. Tested on PostgreSQL 14–18.
Released By
nix
License
MIT
Resources
Special Files
Tags

Extensions

kham_pg 0.6.0
Thai word-segmentation FTS parser — tsvector, soundex, RTGS romanization, NER

README

kham_pg

PostgreSQL text-search extension for the Thai language. Provides a custom parser, phonetic dictionary, and ready-to-use FTS configuration so Thai documents can be indexed and queried with tsvector / tsquery.

Thai has no spaces between words. Standard PostgreSQL parsers treat an entire Thai sentence as one token. kham_pg uses the kham newmm segmentation engine to split Thai text into correct word boundaries, then expands each token into up to three lexemes at the same tsvector position:

  1. The normalised word itself
  2. Its lk82 Thai Soundex code — enables phonetic-fuzzy search
  3. Its RTGS romanization — enables Latin-script search

Named entities (persons, places, organisations) are tagged automatically.

Install

Try Option 1 first. Pre-built binaries require no compiler or Rust toolchain. Fall back to Option 2 only if a pre-built binary is not available for your platform or PostgreSQL version.


Option 1 — Pre-built binary (recommended, no Rust required)

Pre-compiled .so files are available for Linux x86_64 and Linux aarch64 (AWS Graviton, Ampere) for PostgreSQL 14–18 on the GitHub Releases page.

Prerequisites

Requirement Notes
PostgreSQL 14–18 Server must be installed; pg_config must be in PATH
Linux x86_64 or aarch64 Pre-built binaries are Linux-only

Steps

# 1. Unzip the PGXN distribution (provides control + SQL files)
unzip kham_pg-0.6.0.zip
cd kham_pg-0.6.0

# 2. Download the pre-built .so for your PG version and architecture
#    Replace PG=17 and ARCH=x86_64 as needed (14–18, x86_64 or aarch64)
PG=17
ARCH=x86_64
VERSION=0.6.0
curl -fsSL \
  "https://github.com/preedep/kham/releases/download/v${VERSION}/kham-pg-v${VERSION}-pg${PG}-${ARCH}-unknown-linux-gnu.tar.gz" \
  | tar xz   # extracts libkham_pg.so

# 3. Install the .so and extension files (sudo required for system PG)
PG_CONFIG=/usr/lib/postgresql/${PG}/bin/pg_config
sudo install -m 755 libkham_pg.so          "$($PG_CONFIG --pkglibdir)/kham_pg.so"
sudo install -m 644 kham_pg.control        "$($PG_CONFIG --sharedir)/extension/"
sudo install -m 644 sql/kham_pg--${VERSION}.sql "$($PG_CONFIG --sharedir)/extension/"

# 4. Load the extension in psql
psql -c "CREATE EXTENSION kham_pg;"

Option 2 — Build from source (fallback)

Use this path if no pre-built binary is available for your platform or PostgreSQL version. Supports any platform where PostgreSQL and Rust are available (Linux, macOS).

Prerequisites

Requirement Notes
PostgreSQL 14–18 pg_config must be in PATH or set via PG_CONFIG env var
Rust 1.85+ Install via rustup.rs
C compiler clang or gcc
Linux system packages build-essential postgresql-server-dev-N (replace N with your PG major version)
brew install gettext macOS only — PostgreSQL headers require libintl.h

Steps

Linux (Debian / Ubuntu)

# Replace 17 with your PostgreSQL major version (14–18)
PG=17
sudo apt-get install -y build-essential postgresql-server-dev-${PG}

unzip kham_pg-0.6.0.zip
cd kham_pg-0.6.0
PG_CONFIG=/usr/lib/postgresql/${PG}/bin/pg_config make install
psql -c "CREATE EXTENSION kham_pg;"

macOS (Homebrew)

# Replace 17 with your PostgreSQL major version (14–18)
PG=17
brew install postgresql@${PG} gettext

unzip kham_pg-0.6.0.zip
cd kham_pg-0.6.0
PG_CONFIG=$(brew --prefix postgresql@${PG})/bin/pg_config make install
psql -c "CREATE EXTENSION kham_pg;"

Token types

SELECT * FROM ts_token_type('kham');
--  1  thai    Thai word
--  2  latin   Latin script token
--  3  number  Numeric token
--  4  punct   Punctuation
--  5  emoji   Emoji token
--  6  unknown Unknown / OOV token
--  7  named   Named entity token (person, place, organisation)

Basic usage

-- Inspect how the parser splits Thai text
SELECT * FROM ts_parse('kham', 'กินข้าวกับปลา');
--  1  กินข้าว
--  1  กับ
--  1  ปลา

-- Build a tsvector — Thai tokens expand to [word, soundex, rtgs]
SELECT to_tsvector('kham', 'กินข้าวกับปลา');
-- '1400':2 '1619':1 '4800':3 'kap':2 'pla':3 'กับ':2 'กินข้าว':1 'ปลา':3

-- Full-text search
SELECT title FROM articles
WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ปลา');

Phonetic search

Thai/Named tokens are automatically expanded with their lk82 Soundex code. Near-homophones share a code and match each other without any extra schema work.

-- Match any word with the same lk82 code as ปลา (4800)
SELECT title FROM articles
WHERE to_tsvector('kham', body) @@ to_tsquery('kham', '4800');

RTGS romanization search

Thai/Named tokens are also expanded with their RTGS romanized form. Latin-script queries match Thai documents automatically.

SELECT title FROM articles
WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'pla');
-- matches documents containing ปลา

Named entity search

SELECT * FROM ts_parse('kham', 'ทักษิณเดินทางไปกรุงเทพ');
--  7  ทักษิณ    ← Named: Person
--  1  เดิน
--  1  ทาง
--  1  ไป
--  7  กรุงเทพ   ← Named: Place

ts_headline

SELECT ts_headline('kham', body, plainto_tsquery('kham', 'ปลา'))
FROM articles;
-- …กิน<b>ปลา</b>กับข้าว…

-- Custom markers
SELECT ts_headline(
    'kham', body,
    plainto_tsquery('kham', 'ปลา'),
    'StartSel=<<<, StopSel=>>>'
) FROM articles;

GIN index

CREATE INDEX articles_fts_idx ON articles
    USING GIN (to_tsvector('kham', body));

-- Query uses the index automatically
SELECT title FROM articles
WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ปลา')
ORDER BY ts_rank(to_tsvector('kham', body), plainto_tsquery('kham', 'ปลา')) DESC;

Upgrade

If you are upgrading from a previous version:

ALTER EXTENSION kham_pg UPDATE;

License

MIT OR Apache-2.0